Project: Identify Customer Segments

In this project, you will apply unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

This notebook will help you complete this task by providing a framework within which you will perform your analysis steps. In each step of the project, you will see some text describing the subtask that you will perform, followed by one or more code cells for you to complete your work. Feel free to add additional code and markdown cells as you go along so that you can explore everything in precise chunks. The code cells provided in the base template will outline only the major tasks, and will usually not be enough to cover all of the minor tasks that comprise it.

It should be noted that while there will be precise guidelines on how you should handle certain tasks in the project, there will also be places where an exact specification is not provided. There will be times in the project where you will need to make and justify your own decisions on how to treat the data. These are places where there may not be only one way to handle the data. In real-life tasks, there may be many valid ways to approach an analysis task. One of the most important things you can do is clearly document your approach so that other scientists can understand the decisions you've made.

At the end of most sections, there will be a Markdown cell labeled Discussion. In these cells, you will report your findings for the completed section, as well as document the decisions that you made in your approach to each subtask. Your project will be evaluated not just on the code used to complete the tasks outlined, but also your communication about your observations and conclusions at each stage.

In [1]:
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# magic word for producing visualizations in notebook
%matplotlib inline

'''
Import note: The classroom currently uses sklearn version 0.19.
If you need to use an imputer, it is available in sklearn.preprocessing.Imputer,
instead of sklearn.impute as in newer versions of sklearn.
'''
Out[1]:
'\nImport note: The classroom currently uses sklearn version 0.19.\nIf you need to use an imputer, it is available in sklearn.preprocessing.Imputer,\ninstead of sklearn.impute as in newer versions of sklearn.\n'

Step 0: Load the Data

There are four files associated with this project (not including this one):

  • Udacity_AZDIAS_Subset.csv: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).
  • Udacity_CUSTOMERS_Subset.csv: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).
  • Data_Dictionary.md: Detailed information file about the features in the provided datasets.
  • AZDIAS_Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columns

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. You will use this information to cluster the general population into groups with similar demographic properties. Then, you will see how the people in the customers dataset fit into those created clusters. The hope here is that certain clusters are over-represented in the customers data, as compared to the general population; those over-represented clusters will be assumed to be part of the core userbase. This information can then be used for further applications, such as targeting for a marketing campaign.

To start off with, load in the demographics data for the general population into a pandas DataFrame, and do the same for the feature attributes summary. Note for all of the .csv data files in this project: they're semicolon (;) delimited, so you'll need an additional argument in your read_csv() call to read in the data properly. Also, considering the size of the main dataset, it may take some time for it to load completely.

Once the dataset is loaded, it's recommended that you take a little bit of time just browsing the general structure of the dataset and feature summary file. You'll be getting deep into the innards of the cleaning in the first major step of the project, so gaining some general familiarity can help you get your bearings.

In [2]:
# Load in the general demographics data.
azdias = pd.read_csv('Udacity_AZDIAS_Subset.csv', sep=';')

# Load in the feature summary file.
feat_info = pd.read_csv('AZDIAS_Feature_Summary.csv', sep=';')
In [3]:
# Check the structure of the data after it's loaded (e.g. print the number of
# rows and columns, print the first few rows).

# I'm checking the data in the 2 cells below
azdias.index[0]
Out[3]:
0
In [4]:
feat_info
Out[4]:
attribute information_level type missing_or_unknown
0 AGER_TYP person categorical [-1,0]
1 ALTERSKATEGORIE_GROB person ordinal [-1,0,9]
2 ANREDE_KZ person categorical [-1,0]
3 CJT_GESAMTTYP person categorical [0]
4 FINANZ_MINIMALIST person ordinal [-1]
5 FINANZ_SPARER person ordinal [-1]
6 FINANZ_VORSORGER person ordinal [-1]
7 FINANZ_ANLEGER person ordinal [-1]
8 FINANZ_UNAUFFAELLIGER person ordinal [-1]
9 FINANZ_HAUSBAUER person ordinal [-1]
10 FINANZTYP person categorical [-1]
11 GEBURTSJAHR person numeric [0]
12 GFK_URLAUBERTYP person categorical []
13 GREEN_AVANTGARDE person categorical []
14 HEALTH_TYP person ordinal [-1,0]
15 LP_LEBENSPHASE_FEIN person mixed [0]
16 LP_LEBENSPHASE_GROB person mixed [0]
17 LP_FAMILIE_FEIN person categorical [0]
18 LP_FAMILIE_GROB person categorical [0]
19 LP_STATUS_FEIN person categorical [0]
20 LP_STATUS_GROB person categorical [0]
21 NATIONALITAET_KZ person categorical [-1,0]
22 PRAEGENDE_JUGENDJAHRE person mixed [-1,0]
23 RETOURTYP_BK_S person ordinal [0]
24 SEMIO_SOZ person ordinal [-1,9]
25 SEMIO_FAM person ordinal [-1,9]
26 SEMIO_REL person ordinal [-1,9]
27 SEMIO_MAT person ordinal [-1,9]
28 SEMIO_VERT person ordinal [-1,9]
29 SEMIO_LUST person ordinal [-1,9]
... ... ... ... ...
55 OST_WEST_KZ building categorical [-1]
56 WOHNLAGE building mixed [-1]
57 CAMEO_DEUG_2015 microcell_rr4 categorical [-1,X]
58 CAMEO_DEU_2015 microcell_rr4 categorical [XX]
59 CAMEO_INTL_2015 microcell_rr4 mixed [-1,XX]
60 KBA05_ANTG1 microcell_rr3 ordinal [-1]
61 KBA05_ANTG2 microcell_rr3 ordinal [-1]
62 KBA05_ANTG3 microcell_rr3 ordinal [-1]
63 KBA05_ANTG4 microcell_rr3 ordinal [-1]
64 KBA05_BAUMAX microcell_rr3 mixed [-1,0]
65 KBA05_GBZ microcell_rr3 ordinal [-1,0]
66 BALLRAUM postcode ordinal [-1]
67 EWDICHTE postcode ordinal [-1]
68 INNENSTADT postcode ordinal [-1]
69 GEBAEUDETYP_RASTER region_rr1 ordinal []
70 KKK region_rr1 ordinal [-1,0]
71 MOBI_REGIO region_rr1 ordinal []
72 ONLINE_AFFINITAET region_rr1 ordinal []
73 REGIOTYP region_rr1 ordinal [-1,0]
74 KBA13_ANZAHL_PKW macrocell_plz8 numeric []
75 PLZ8_ANTG1 macrocell_plz8 ordinal [-1]
76 PLZ8_ANTG2 macrocell_plz8 ordinal [-1]
77 PLZ8_ANTG3 macrocell_plz8 ordinal [-1]
78 PLZ8_ANTG4 macrocell_plz8 ordinal [-1]
79 PLZ8_BAUMAX macrocell_plz8 mixed [-1,0]
80 PLZ8_HHZ macrocell_plz8 ordinal [-1]
81 PLZ8_GBZ macrocell_plz8 ordinal [-1]
82 ARBEIT community ordinal [-1,9]
83 ORTSGR_KLS9 community ordinal [-1,0]
84 RELAT_AB community ordinal [-1,9]

85 rows × 4 columns

In [5]:
# I'm checking the heading in the 2 cells below
azdias.head()
Out[5]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 -1 2 1 2.0 3 4 3 5 5 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 -1 1 2 5.0 1 5 2 5 4 5 ... 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0
2 -1 3 2 3.0 1 4 1 2 3 5 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0
3 2 4 2 2.0 4 2 5 2 1 2 ... 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0
4 -1 3 1 5.0 4 3 4 1 3 2 ... 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0

5 rows × 85 columns

In [6]:
feat_info.head()
Out[6]:
attribute information_level type missing_or_unknown
0 AGER_TYP person categorical [-1,0]
1 ALTERSKATEGORIE_GROB person ordinal [-1,0,9]
2 ANREDE_KZ person categorical [-1,0]
3 CJT_GESAMTTYP person categorical [0]
4 FINANZ_MINIMALIST person ordinal [-1]
In [7]:
# I'm checking the .describe() function in the 2 cells below
azdias.describe()
Out[7]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
count 891221.000000 891221.000000 891221.000000 886367.000000 891221.000000 891221.000000 891221.000000 891221.000000 891221.000000 891221.000000 ... 774706.000000 774706.000000 774706.000000 774706.000000 774706.000000 774706.000000 774706.000000 794005.000000 794005.000000 794005.00000
mean -0.358435 2.777398 1.522098 3.632838 3.074528 2.821039 3.401106 3.033328 2.874167 3.075121 ... 2.253330 2.801858 1.595426 0.699166 1.943913 3.612821 3.381087 3.167854 5.293002 3.07222
std 1.198724 1.068775 0.499512 1.595021 1.321055 1.464749 1.322134 1.529603 1.486731 1.353248 ... 0.972008 0.920309 0.986736 0.727137 1.459654 0.973967 1.111598 1.002376 2.303739 1.36298
min -1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000 1.00000
25% -1.000000 2.000000 1.000000 2.000000 2.000000 1.000000 3.000000 2.000000 2.000000 2.000000 ... 1.000000 2.000000 1.000000 0.000000 1.000000 3.000000 3.000000 3.000000 4.000000 2.00000
50% -1.000000 3.000000 2.000000 4.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 ... 2.000000 3.000000 2.000000 1.000000 1.000000 4.000000 3.000000 3.000000 5.000000 3.00000
75% -1.000000 4.000000 2.000000 5.000000 4.000000 4.000000 5.000000 5.000000 4.000000 4.000000 ... 3.000000 3.000000 2.000000 1.000000 3.000000 4.000000 4.000000 4.000000 7.000000 4.00000
max 3.000000 9.000000 2.000000 6.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 ... 4.000000 4.000000 3.000000 2.000000 5.000000 5.000000 5.000000 9.000000 9.000000 9.00000

8 rows × 81 columns

In [8]:
feat_info.describe()
Out[8]:
attribute information_level type missing_or_unknown
count 85 85 85 85
unique 85 9 5 9
top SEMIO_KAEM person ordinal [-1]
freq 1 43 49 26
In [9]:
# plot the data here later?
# plt.plot() ??

Tip: Add additional cells to keep everything in reasonably-sized chunks! Keyboard shortcut esc --> a (press escape to enter command mode, then press the 'A' key) adds a new cell before the active cell, and esc --> b adds a new cell after the active cell. If you need to convert an active cell to a markdown cell, use esc --> m and to convert to a code cell, use esc --> y.

Step 1: Preprocessing

Step 1.1: Assess Missing Data

The feature summary file contains a summary of properties for each demographics data column. You will use this file to help you make cleaning decisions during this stage of the project. First of all, you should assess the demographics data in terms of missing data. Pay attention to the following points as you perform your analysis, and take notes on what you observe. Make sure that you fill in the Discussion cell with your findings and decisions at the end of each step that has one!

Step 1.1.1: Convert Missing Value Codes to NaNs

The fourth column of the feature attributes summary (loaded in above as feat_info) documents the codes from the data dictionary that indicate missing or unknown data. While the file encodes this as a list (e.g. [-1,0]), this will get read in as a string object. You'll need to do a little bit of parsing to make use of it to identify and clean the data. Convert data that matches a 'missing' or 'unknown' value code into a numpy NaN value. You might want to see how much data takes on a 'missing' or 'unknown' code, and how much data is naturally missing, as a point of interest.

As one more reminder, you are encouraged to add additional cells to break up your analysis into manageable chunks.

In [10]:
#I was trying to figure out what kind of data is missing below, not important.
# for i in azdias['CAMEO_DEU_2015']:
#     if i == "XX":
#         print(i)
# print("DONE")
# np.argwhere(np.isnan(feat_info_np))
In [11]:
def convert(word):
    temp = word.replace('[', '')
    temp = temp.replace(']', '')
#     temp = temp.replace(',', '')
    temp = temp.split(",")
    return temp
In [12]:
# Identify missing or unknown data values and convert them to NaNs.

# Moving through the whole missing_or_unknown column

for i in range(len(feat_info['missing_or_unknown'])):
#     print(i)
#     print(feat_info['missing_or_unknown'][i])
    # convert(word) basically removes the brackets and returns a list of items split by ",". Saved the list in converted
    converted = convert(feat_info['missing_or_unknown'][i])
    print(converted)
    
    # Getting the column name to use in azdias DataFrame
        # For each value I want to replace, given from "converted" above
    for value in converted:
        try:
            value = int(value)
        except:
            None

        # Replace the value to NaN
        azdias[feat_info['attribute'][i]] = azdias[feat_info['attribute'][i]].replace(value, np.NaN)
        
print("DONE")
['-1', '0']
['-1', '0', '9']
['-1', '0']
['0']
['-1']
['-1']
['-1']
['-1']
['-1']
['-1']
['-1']
['0']
['']
['']
['-1', '0']
['0']
['0']
['0']
['0']
['0']
['0']
['-1', '0']
['-1', '0']
['0']
['-1', '9']
['-1', '9']
['-1', '9']
['-1', '9']
['-1', '9']
['-1', '9']
['-1', '9']
['-1', '9']
['-1', '9']
['-1', '9']
['-1', '9']
['-1', '9']
['-1', '9']
['-1', '9']
['-1']
['-1']
['-1', '0']
['-1']
['-1', '9']
['0']
['']
['']
['-1', '0']
['-1']
['-1', '0']
['-1', '0']
['0']
['']
['-1', '0']
['']
['0']
['-1']
['-1']
['-1', 'X']
['XX']
['-1', 'XX']
['-1']
['-1']
['-1']
['-1']
['-1', '0']
['-1', '0']
['-1']
['-1']
['-1']
['']
['-1', '0']
['']
['']
['-1', '0']
['']
['-1']
['-1']
['-1']
['-1']
['-1', '0']
['-1']
['-1']
['-1', '9']
['-1', '0']
['-1', '9']
DONE
In [13]:
# Perform an assessment of how much missing data there is in each column of the
# dataset.

# Total amount of NaN values

total = 0
dic_attr_NaN = {}

for c in azdias.columns:

    n_NaN = azdias.loc[azdias[c].astype(str) == "nan"][c].isna().count()
    total += n_NaN
    dic_attr_NaN[c] = n_NaN
    print(f"In the column {c}, there's {n_NaN} NaN values.\n")
print(f"Total number of NaN values is {total}")
print(dic_attr_NaN)
In the column AGER_TYP, there's 685843 NaN values.

In the column ALTERSKATEGORIE_GROB, there's 2881 NaN values.

In the column ANREDE_KZ, there's 0 NaN values.

In the column CJT_GESAMTTYP, there's 4854 NaN values.

In the column FINANZ_MINIMALIST, there's 0 NaN values.

In the column FINANZ_SPARER, there's 0 NaN values.

In the column FINANZ_VORSORGER, there's 0 NaN values.

In the column FINANZ_ANLEGER, there's 0 NaN values.

In the column FINANZ_UNAUFFAELLIGER, there's 0 NaN values.

In the column FINANZ_HAUSBAUER, there's 0 NaN values.

In the column FINANZTYP, there's 0 NaN values.

In the column GEBURTSJAHR, there's 392318 NaN values.

In the column GFK_URLAUBERTYP, there's 4854 NaN values.

In the column GREEN_AVANTGARDE, there's 0 NaN values.

In the column HEALTH_TYP, there's 111196 NaN values.

In the column LP_LEBENSPHASE_FEIN, there's 97632 NaN values.

In the column LP_LEBENSPHASE_GROB, there's 94572 NaN values.

In the column LP_FAMILIE_FEIN, there's 77792 NaN values.

In the column LP_FAMILIE_GROB, there's 77792 NaN values.

In the column LP_STATUS_FEIN, there's 4854 NaN values.

In the column LP_STATUS_GROB, there's 4854 NaN values.

In the column NATIONALITAET_KZ, there's 108315 NaN values.

In the column PRAEGENDE_JUGENDJAHRE, there's 108164 NaN values.

In the column RETOURTYP_BK_S, there's 4854 NaN values.

In the column SEMIO_SOZ, there's 0 NaN values.

In the column SEMIO_FAM, there's 0 NaN values.

In the column SEMIO_REL, there's 0 NaN values.

In the column SEMIO_MAT, there's 0 NaN values.

In the column SEMIO_VERT, there's 0 NaN values.

In the column SEMIO_LUST, there's 0 NaN values.

In the column SEMIO_ERL, there's 0 NaN values.

In the column SEMIO_KULT, there's 0 NaN values.

In the column SEMIO_RAT, there's 0 NaN values.

In the column SEMIO_KRIT, there's 0 NaN values.

In the column SEMIO_DOM, there's 0 NaN values.

In the column SEMIO_KAEM, there's 0 NaN values.

In the column SEMIO_PFLICHT, there's 0 NaN values.

In the column SEMIO_TRADV, there's 0 NaN values.

In the column SHOPPER_TYP, there's 111196 NaN values.

In the column SOHO_KZ, there's 73499 NaN values.

In the column TITEL_KZ, there's 889061 NaN values.

In the column VERS_TYP, there's 111196 NaN values.

In the column ZABEOTYP, there's 0 NaN values.

In the column ALTER_HH, there's 310267 NaN values.

In the column ANZ_PERSONEN, there's 73499 NaN values.

In the column ANZ_TITEL, there's 73499 NaN values.

In the column HH_EINKOMMEN_SCORE, there's 18348 NaN values.

In the column KK_KUNDENTYP, there's 584612 NaN values.

In the column W_KEIT_KIND_HH, there's 147988 NaN values.

In the column WOHNDAUER_2008, there's 73499 NaN values.

In the column ANZ_HAUSHALTE_AKTIV, there's 99611 NaN values.

In the column ANZ_HH_TITEL, there's 97008 NaN values.

In the column GEBAEUDETYP, there's 93148 NaN values.

In the column KONSUMNAEHE, there's 73969 NaN values.

In the column MIN_GEBAEUDEJAHR, there's 93148 NaN values.

In the column OST_WEST_KZ, there's 93148 NaN values.

In the column WOHNLAGE, there's 93148 NaN values.

In the column CAMEO_DEUG_2015, there's 99352 NaN values.

In the column CAMEO_DEU_2015, there's 99352 NaN values.

In the column CAMEO_INTL_2015, there's 99352 NaN values.

In the column KBA05_ANTG1, there's 133324 NaN values.

In the column KBA05_ANTG2, there's 133324 NaN values.

In the column KBA05_ANTG3, there's 133324 NaN values.

In the column KBA05_ANTG4, there's 133324 NaN values.

In the column KBA05_BAUMAX, there's 476524 NaN values.

In the column KBA05_GBZ, there's 133324 NaN values.

In the column BALLRAUM, there's 93740 NaN values.

In the column EWDICHTE, there's 93740 NaN values.

In the column INNENSTADT, there's 93740 NaN values.

In the column GEBAEUDETYP_RASTER, there's 93155 NaN values.

In the column KKK, there's 158064 NaN values.

In the column MOBI_REGIO, there's 133324 NaN values.

In the column ONLINE_AFFINITAET, there's 4854 NaN values.

In the column REGIOTYP, there's 158064 NaN values.

In the column KBA13_ANZAHL_PKW, there's 105800 NaN values.

In the column PLZ8_ANTG1, there's 116515 NaN values.

In the column PLZ8_ANTG2, there's 116515 NaN values.

In the column PLZ8_ANTG3, there's 116515 NaN values.

In the column PLZ8_ANTG4, there's 116515 NaN values.

In the column PLZ8_BAUMAX, there's 116515 NaN values.

In the column PLZ8_HHZ, there's 116515 NaN values.

In the column PLZ8_GBZ, there's 116515 NaN values.

In the column ARBEIT, there's 97375 NaN values.

In the column ORTSGR_KLS9, there's 97274 NaN values.

In the column RELAT_AB, there's 97375 NaN values.

Total number of NaN values is 8373929
{'AGER_TYP': 685843, 'ALTERSKATEGORIE_GROB': 2881, 'ANREDE_KZ': 0, 'CJT_GESAMTTYP': 4854, 'FINANZ_MINIMALIST': 0, 'FINANZ_SPARER': 0, 'FINANZ_VORSORGER': 0, 'FINANZ_ANLEGER': 0, 'FINANZ_UNAUFFAELLIGER': 0, 'FINANZ_HAUSBAUER': 0, 'FINANZTYP': 0, 'GEBURTSJAHR': 392318, 'GFK_URLAUBERTYP': 4854, 'GREEN_AVANTGARDE': 0, 'HEALTH_TYP': 111196, 'LP_LEBENSPHASE_FEIN': 97632, 'LP_LEBENSPHASE_GROB': 94572, 'LP_FAMILIE_FEIN': 77792, 'LP_FAMILIE_GROB': 77792, 'LP_STATUS_FEIN': 4854, 'LP_STATUS_GROB': 4854, 'NATIONALITAET_KZ': 108315, 'PRAEGENDE_JUGENDJAHRE': 108164, 'RETOURTYP_BK_S': 4854, 'SEMIO_SOZ': 0, 'SEMIO_FAM': 0, 'SEMIO_REL': 0, 'SEMIO_MAT': 0, 'SEMIO_VERT': 0, 'SEMIO_LUST': 0, 'SEMIO_ERL': 0, 'SEMIO_KULT': 0, 'SEMIO_RAT': 0, 'SEMIO_KRIT': 0, 'SEMIO_DOM': 0, 'SEMIO_KAEM': 0, 'SEMIO_PFLICHT': 0, 'SEMIO_TRADV': 0, 'SHOPPER_TYP': 111196, 'SOHO_KZ': 73499, 'TITEL_KZ': 889061, 'VERS_TYP': 111196, 'ZABEOTYP': 0, 'ALTER_HH': 310267, 'ANZ_PERSONEN': 73499, 'ANZ_TITEL': 73499, 'HH_EINKOMMEN_SCORE': 18348, 'KK_KUNDENTYP': 584612, 'W_KEIT_KIND_HH': 147988, 'WOHNDAUER_2008': 73499, 'ANZ_HAUSHALTE_AKTIV': 99611, 'ANZ_HH_TITEL': 97008, 'GEBAEUDETYP': 93148, 'KONSUMNAEHE': 73969, 'MIN_GEBAEUDEJAHR': 93148, 'OST_WEST_KZ': 93148, 'WOHNLAGE': 93148, 'CAMEO_DEUG_2015': 99352, 'CAMEO_DEU_2015': 99352, 'CAMEO_INTL_2015': 99352, 'KBA05_ANTG1': 133324, 'KBA05_ANTG2': 133324, 'KBA05_ANTG3': 133324, 'KBA05_ANTG4': 133324, 'KBA05_BAUMAX': 476524, 'KBA05_GBZ': 133324, 'BALLRAUM': 93740, 'EWDICHTE': 93740, 'INNENSTADT': 93740, 'GEBAEUDETYP_RASTER': 93155, 'KKK': 158064, 'MOBI_REGIO': 133324, 'ONLINE_AFFINITAET': 4854, 'REGIOTYP': 158064, 'KBA13_ANZAHL_PKW': 105800, 'PLZ8_ANTG1': 116515, 'PLZ8_ANTG2': 116515, 'PLZ8_ANTG3': 116515, 'PLZ8_ANTG4': 116515, 'PLZ8_BAUMAX': 116515, 'PLZ8_HHZ': 116515, 'PLZ8_GBZ': 116515, 'ARBEIT': 97375, 'ORTSGR_KLS9': 97274, 'RELAT_AB': 97375}
In [14]:
# Checking in a column to make sure convertion worked
for i in azdias['CAMEO_DEU_2015']:
    if i == "XX":
        print(i)
        # Nothing printed, so it worked, value XX is converted to NaN.

Step 1.1.2: Assess Missing Data in Each Column

How much missing data is present in each column? There are a few columns that are outliers in terms of the proportion of values that are missing. You will want to use matplotlib's hist() function to visualize the distribution of missing value counts to find these columns. Identify and document these columns. While some of these columns might have justifications for keeping or re-encoding the data, for this project you should just remove them from the dataframe. (Feel free to make remarks about these outlier columns in the discussion, however!)

For the remaining features, are there any patterns in which columns have, or share, missing data?

In [15]:
for column in feat_info['attribute']:
    print(column)
    # .astype(str) to be able to draw np.NaN and numeric,string values together
    plt.hist(azdias[column].astype(str))
    plt.show()
print("DONE")
AGER_TYP
ALTERSKATEGORIE_GROB
ANREDE_KZ
CJT_GESAMTTYP
FINANZ_MINIMALIST
FINANZ_SPARER
FINANZ_VORSORGER
FINANZ_ANLEGER
FINANZ_UNAUFFAELLIGER
FINANZ_HAUSBAUER
FINANZTYP
GEBURTSJAHR
GFK_URLAUBERTYP
GREEN_AVANTGARDE
HEALTH_TYP
LP_LEBENSPHASE_FEIN
LP_LEBENSPHASE_GROB
LP_FAMILIE_FEIN
LP_FAMILIE_GROB
LP_STATUS_FEIN
LP_STATUS_GROB
NATIONALITAET_KZ
PRAEGENDE_JUGENDJAHRE
RETOURTYP_BK_S
SEMIO_SOZ
SEMIO_FAM
SEMIO_REL
SEMIO_MAT
SEMIO_VERT
SEMIO_LUST
SEMIO_ERL
SEMIO_KULT
SEMIO_RAT
SEMIO_KRIT
SEMIO_DOM
SEMIO_KAEM
SEMIO_PFLICHT
SEMIO_TRADV
SHOPPER_TYP
SOHO_KZ
TITEL_KZ
VERS_TYP
ZABEOTYP
ALTER_HH
ANZ_PERSONEN
ANZ_TITEL
HH_EINKOMMEN_SCORE
KK_KUNDENTYP
W_KEIT_KIND_HH
WOHNDAUER_2008
ANZ_HAUSHALTE_AKTIV
ANZ_HH_TITEL
GEBAEUDETYP
KONSUMNAEHE
MIN_GEBAEUDEJAHR
OST_WEST_KZ
WOHNLAGE
CAMEO_DEUG_2015
CAMEO_DEU_2015
CAMEO_INTL_2015
KBA05_ANTG1
KBA05_ANTG2
KBA05_ANTG3
KBA05_ANTG4
KBA05_BAUMAX
KBA05_GBZ
BALLRAUM
EWDICHTE
INNENSTADT
GEBAEUDETYP_RASTER
KKK
MOBI_REGIO
ONLINE_AFFINITAET
REGIOTYP
KBA13_ANZAHL_PKW
PLZ8_ANTG1
PLZ8_ANTG2
PLZ8_ANTG3
PLZ8_ANTG4
PLZ8_BAUMAX
PLZ8_HHZ
PLZ8_GBZ
ARBEIT
ORTSGR_KLS9
RELAT_AB
DONE
In [16]:
# Investigate patterns in the amount of missing data in each column.

# Few cells ago, I saved NaN values in a dictionary, these are the 6 columns with most NaN values
most_NaN = dict(sorted(dic_attr_NaN.items(), key=lambda item: item[1], reverse=True)[:6])
most_NaN
Out[16]:
{'TITEL_KZ': 889061,
 'AGER_TYP': 685843,
 'KK_KUNDENTYP': 584612,
 'KBA05_BAUMAX': 476524,
 'GEBURTSJAHR': 392318,
 'ALTER_HH': 310267}
In [17]:
from matplotlib.pyplot import figure

figure(figsize=(15, 30), dpi=80)
# Below is a graph to show each column and the amount of NaN values it has, we can clearly see TITEL_KZ, AGER_TYP,
# KK_KUNDENTYP, KBA05_BAUMAX, GEBURTSJAHR, and ALTER_HH having way more NaN values compared to others, I deleted the highest 6 columns
plt.scatter(dic_attr_NaN.values(), dic_attr_NaN.keys())
Out[17]:
<matplotlib.collections.PathCollection at 0x7fe38f7096d8>
In [18]:
# Remove the outlier columns from the dataset. (You'll perform other data
# engineering tasks such as re-encoding and imputation later.)

old_azdias = azdias
for i in most_NaN.keys():
    azdias = azdias.drop([i], axis=1)
In [19]:
# Now the columns are gone in azdias_reduced
print(len(old_azdias.columns.values))
print(len(azdias.columns.values))
85
79

Discussion 1.1.2: Assess Missing Data in Each Column

(Double click this cell and replace this text with your own text, reporting your observations regarding the amount of missing data in each column. Are there any patterns in missing values? Which columns were removed from the dataset?)

Answer:

I notice there's a lot of missing data. TITEL_KZ, AGER_TYP, KK_KUNDENTYP, KBA05_BAUMAX, GEBURTSJAHR, and ALTER_HH have so many NaN values so I dropped them

Step 1.1.3: Assess Missing Data in Each Row

Now, you'll perform a similar assessment for the rows of the dataset. How much data is missing in each row? As with the columns, you should see some groups of points that have a very different numbers of missing values. Divide the data into two subsets: one for data points that are above some threshold for missing values, and a second subset for points below that threshold.

In order to know what to do with the outlier rows, we should see if the distribution of data values on columns that are not missing data (or are missing very little data) are similar or different between the two groups. Select at least five of these columns and compare the distribution of values.

  • You can use seaborn's countplot() function to create a bar chart of code frequencies and matplotlib's subplot() function to put bar charts for the two subplots side by side.
  • To reduce repeated code, you might want to write a function that can perform this comparison, taking as one of its arguments a column to be compared.

Depending on what you observe in your comparison, this will have implications on how you approach your conclusions later in the analysis. If the distributions of non-missing features look similar between the data with many missing values and the data with few or no missing values, then we could argue that simply dropping those points from the analysis won't present a major issue. On the other hand, if the data with many missing values looks very different from the data with few or no missing values, then we should make a note on those data as special. We'll revisit these data later on. Either way, you should continue your analysis for now using just the subset of the data with few or no missing values.

In [20]:
# Old tries, not important

# nan_values = azdias_reduced[azdias_reduced.isna().any(axis=1)]
# nan_values
# NaN_appeared = {}
# NaN_missing = {}

# azdias_reduced=azdias    

# temp = azdias.applymap(lambda x: str(x) == "nan")

# temp
# print("START")
# for i in azdias.index:
#     appeared = False
#     for j in azdias:
#         if str(azdias[j][i]) == "nan":
#             NaN_appeared.append(azdias.loc[:i])
#             appeared = True
#     if not appeared:
#         NaN_missing.append(azdias.loc[:i])
#     if i%10000 == 0:
#         print(i)
# print("END")

# def col_iter(y):
#     if str(azdias[y]) == "nan":
#         return True
#     else:
#         return False

# def row_iter(x):
#     appeared = False
#     for j in azdias:
#         if str(azdias[j][x]) == "nan":
#             NaN_appeared[x] = azdias.loc[:x]
#             appeared = True
#         if not appeared:
#             NaN_missing[x] = azdias.loc[:x]
#     if x%100 == 0:
#         print(x)   
#     print("END")
#     appeared = False
#     col_vec = np.vectorize(row_iter)
#     col_vec(azdias.iloc[:x])
# row_vec = np.vectorize(row_iter)
# row_vec(azdias.index)
In [21]:
dd = azdias["VERS_TYP"].astype(str)
dd
Out[21]:
0         nan
1         2.0
2         1.0
3         1.0
4         2.0
5         2.0
6         1.0
7         1.0
8         2.0
9         2.0
10        2.0
11        nan
12        1.0
13        2.0
14        nan
15        2.0
16        2.0
17        nan
18        1.0
19        2.0
20        1.0
21        2.0
22        2.0
23        2.0
24        nan
25        2.0
26        1.0
27        2.0
28        2.0
29        2.0
         ... 
891191    2.0
891192    1.0
891193    1.0
891194    2.0
891195    2.0
891196    1.0
891197    2.0
891198    2.0
891199    1.0
891200    2.0
891201    2.0
891202    1.0
891203    1.0
891204    2.0
891205    1.0
891206    2.0
891207    1.0
891208    2.0
891209    2.0
891210    2.0
891211    1.0
891212    1.0
891213    1.0
891214    2.0
891215    1.0
891216    1.0
891217    nan
891218    1.0
891219    2.0
891220    1.0
Name: VERS_TYP, Length: 891221, dtype: object
In [22]:
# I've chosen 10 columns to compare with different thresh, ones with good amount of NaN values.
# KBA05_ANTG1, KBA05_ANTG2, KBA05_ANTG3, KBA05_ANTG4, ONLINE_AFFINITAET, KBA05_GBZ, BALLRAUM, EWDICHTE, INNENSTADT, KKK

chosen_c = ["KBA05_ANTG1", "KBA05_ANTG2", "KBA05_ANTG3", "KBA05_ANTG4", "ONLINE_AFFINITAET",
            "KBA05_GBZ", "BALLRAUM", "EWDICHTE", "INNENSTADT", "KKK"]


thr = [10,20,30,40,50]


for t in thr:
    plt.figure(figsize = [30, 10]) 
    print("Cycle started")
    temp_data = azdias.dropna(thresh=t)
    i=1
    for c in chosen_c:
        temp_data[c] = temp_data[c].astype(str)
        
        plt.subplot(1, 10, i) 
        i+=1
        sns.countplot(data=temp_data, x=c)
        print(f"Column {c} Done")
    plt.show()
    print(f"Thresh {t} Done")
Cycle started
Column KBA05_ANTG1 Done
Column KBA05_ANTG2 Done
Column KBA05_ANTG3 Done
Column KBA05_ANTG4 Done
Column ONLINE_AFFINITAET Done
Column KBA05_GBZ Done
Column BALLRAUM Done
Column EWDICHTE Done
Column INNENSTADT Done
Column KKK Done
Thresh 10 Done
Cycle started
Column KBA05_ANTG1 Done
Column KBA05_ANTG2 Done
Column KBA05_ANTG3 Done
Column KBA05_ANTG4 Done
Column ONLINE_AFFINITAET Done
Column KBA05_GBZ Done
Column BALLRAUM Done
Column EWDICHTE Done
Column INNENSTADT Done
Column KKK Done
Thresh 20 Done
Cycle started
Column KBA05_ANTG1 Done
Column KBA05_ANTG2 Done
Column KBA05_ANTG3 Done
Column KBA05_ANTG4 Done
Column ONLINE_AFFINITAET Done
Column KBA05_GBZ Done
Column BALLRAUM Done
Column EWDICHTE Done
Column INNENSTADT Done
Column KKK Done
Thresh 30 Done
Cycle started
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:17: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Column KBA05_ANTG1 Done
Column KBA05_ANTG2 Done
Column KBA05_ANTG3 Done
Column KBA05_ANTG4 Done
Column ONLINE_AFFINITAET Done
Column KBA05_GBZ Done
Column BALLRAUM Done
Column EWDICHTE Done
Column INNENSTADT Done
Column KKK Done
Thresh 40 Done
Cycle started
Column KBA05_ANTG1 Done
Column KBA05_ANTG2 Done
Column KBA05_ANTG3 Done
Column KBA05_ANTG4 Done
Column ONLINE_AFFINITAET Done
Column KBA05_GBZ Done
Column BALLRAUM Done
Column EWDICHTE Done
Column INNENSTADT Done
Column KKK Done
Thresh 50 Done
In [23]:
# How much data is missing in each row of the dataset?


NaN_appeared = azdias[azdias.isna().any(axis=1)]

NaN_missing = azdias[azdias.isna().any(axis=1) == False]

# Here's rows with NaN values
NaN_appeared
Out[23]:
ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 2.0 1 2.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4.0 2 2.0 4 2 5 2 1 2 6 ... 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0
11 2.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 3.0 1 6.0 5 3 4 2 4 1 3 ... 3.0 3.0 1.0 0.0 1.0 5.0 5.0 3.0 6.0 4.0
13 1.0 2 5.0 1 4 3 5 5 2 1 ... 2.0 1.0 1.0 1.0 1.0 3.0 3.0 3.0 6.0 4.0
14 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 4.0 2 4.0 4 1 5 1 1 4 2 ... NaN NaN NaN NaN NaN NaN NaN 4.0 8.0 5.0
16 1.0 2 1.0 4 3 1 4 5 1 3 ... 3.0 3.0 1.0 0.0 1.0 3.0 4.0 1.0 2.0 1.0
17 2.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 2.0 2 6.0 2 4 1 5 4 1 1 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 3.0 4.0 3.0
20 2.0 2 4.0 4 3 1 4 5 1 3 ... NaN NaN NaN NaN NaN NaN NaN 3.0 4.0 1.0
23 3.0 1 3.0 5 3 3 2 2 1 6 ... NaN NaN NaN NaN NaN NaN NaN 3.0 6.0 2.0
24 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25 1.0 1 3.0 3 5 3 5 4 3 4 ... 2.0 2.0 0.0 0.0 1.0 4.0 5.0 3.0 2.0 5.0
26 3.0 1 3.0 5 2 4 2 3 1 3 ... NaN NaN NaN NaN NaN NaN NaN 4.0 3.0 5.0
29 4.0 2 1.0 5 1 5 2 1 3 6 ... 3.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 4.0 3.0
30 3.0 2 3.0 4 3 4 4 4 1 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
34 1.0 2 2.0 4 1 5 1 2 3 2 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 4.0 7.0 5.0
35 2.0 2 2.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
39 4.0 1 4.0 5 3 4 3 1 1 6 ... 2.0 3.0 2.0 1.0 1.0 4.0 4.0 3.0 5.0 3.0
40 4.0 2 1.0 3 2 5 2 1 5 6 ... 3.0 2.0 1.0 0.0 1.0 2.0 3.0 3.0 5.0 3.0
42 1.0 2 5.0 3 4 2 5 4 2 3 ... 4.0 3.0 1.0 0.0 1.0 2.0 3.0 3.0 3.0 1.0
46 NaN 2 3.0 2 4 3 5 5 4 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
48 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
53 2.0 1 3.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
54 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
60 2.0 1 6.0 3 2 5 1 1 5 5 ... 1.0 4.0 3.0 2.0 4.0 4.0 2.0 3.0 7.0 2.0
61 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
62 3.0 1 4.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
64 2.0 2 6.0 1 4 2 4 4 4 1 ... 3.0 3.0 2.0 0.0 1.0 4.0 4.0 4.0 6.0 4.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
891129 3.0 2 5.0 1 5 3 5 4 5 1 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 2.0 3.0 3.0
891130 3.0 1 5.0 5 3 4 3 4 1 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891135 3.0 2 2.0 3 2 4 2 3 2 2 ... 1.0 3.0 3.0 2.0 5.0 4.0 2.0 4.0 8.0 5.0
891136 1.0 2 6.0 3 4 3 5 4 2 4 ... 3.0 3.0 0.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0
891137 4.0 2 4.0 4 1 5 2 3 3 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891139 3.0 2 5.0 1 5 2 5 2 5 1 ... 1.0 3.0 1.0 2.0 5.0 4.0 1.0 4.0 7.0 3.0
891140 4.0 2 1.0 4 1 5 2 3 3 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891150 4.0 2 3.0 2 2 5 2 1 5 6 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 2.0 3.0 3.0
891154 4.0 1 1.0 5 1 5 1 2 3 5 ... NaN NaN NaN NaN NaN NaN NaN 2.0 4.0 3.0
891155 3.0 2 6.0 5 3 2 3 5 1 3 ... 3.0 2.0 1.0 0.0 1.0 4.0 5.0 1.0 1.0 1.0
891157 1.0 1 2.0 5 1 5 1 2 2 5 ... 3.0 3.0 1.0 1.0 1.0 5.0 4.0 4.0 7.0 5.0
891159 2.0 1 1.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891161 2.0 2 3.0 1 4 2 5 4 3 1 ... 2.0 3.0 2.0 1.0 1.0 5.0 5.0 4.0 5.0 4.0
891162 4.0 2 4.0 4 1 5 1 2 5 2 ... 1.0 4.0 2.0 1.0 2.0 4.0 3.0 3.0 4.0 4.0
891164 3.0 2 4.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891169 3.0 1 1.0 5 2 4 2 3 1 3 ... NaN NaN NaN NaN NaN NaN NaN 1.0 2.0 1.0
891170 3.0 1 4.0 4 5 2 4 5 1 3 ... NaN NaN NaN NaN NaN NaN NaN 3.0 4.0 1.0
891171 3.0 2 5.0 4 2 3 5 5 1 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891172 1.0 2 1.0 1 4 3 5 5 2 1 ... NaN NaN NaN NaN NaN NaN NaN 3.0 4.0 4.0
891173 4.0 1 1.0 4 1 5 1 2 2 5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891175 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891181 2.0 2 5.0 2 3 4 4 3 3 1 ... 2.0 3.0 2.0 2.0 4.0 5.0 3.0 3.0 4.0 4.0
891184 3.0 2 6.0 1 5 3 5 4 5 1 ... 1.0 4.0 3.0 1.0 3.0 3.0 2.0 4.0 8.0 3.0
891185 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891187 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891203 4.0 2 1.0 4 1 5 1 3 4 5 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 4.0 8.0 5.0
891208 4.0 1 2.0 5 2 4 2 3 1 6 ... 3.0 2.0 0.0 0.0 1.0 3.0 3.0 4.0 2.0 5.0
891216 3.0 2 5.0 1 4 2 5 4 4 1 ... 3.0 2.0 0.0 0.0 1.0 2.0 3.0 NaN NaN NaN
891217 2.0 1 4.0 3 3 3 2 2 3 6 ... 1.0 3.0 3.0 2.0 4.0 5.0 3.0 4.0 6.0 5.0
891218 2.0 2 4.0 2 4 2 5 4 3 1 ... 4.0 2.0 0.0 0.0 1.0 3.0 4.0 2.0 2.0 3.0

268012 rows × 79 columns

In [24]:
# Write code to divide the data into two subsets based on the number of missing
# values in each row.

# Now that I have data with NaN values, I have total of 85 columns
# From the visualization above, I decided thresh=20 is good enough.
NaN_appeared = NaN_appeared.dropna(thresh=20)

NaN_appeared
Out[24]:
ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 2.0 1 2.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4.0 2 2.0 4 2 5 2 1 2 6 ... 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0
11 2.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 3.0 1 6.0 5 3 4 2 4 1 3 ... 3.0 3.0 1.0 0.0 1.0 5.0 5.0 3.0 6.0 4.0
13 1.0 2 5.0 1 4 3 5 5 2 1 ... 2.0 1.0 1.0 1.0 1.0 3.0 3.0 3.0 6.0 4.0
14 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 4.0 2 4.0 4 1 5 1 1 4 2 ... NaN NaN NaN NaN NaN NaN NaN 4.0 8.0 5.0
16 1.0 2 1.0 4 3 1 4 5 1 3 ... 3.0 3.0 1.0 0.0 1.0 3.0 4.0 1.0 2.0 1.0
17 2.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 2.0 2 6.0 2 4 1 5 4 1 1 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 3.0 4.0 3.0
20 2.0 2 4.0 4 3 1 4 5 1 3 ... NaN NaN NaN NaN NaN NaN NaN 3.0 4.0 1.0
23 3.0 1 3.0 5 3 3 2 2 1 6 ... NaN NaN NaN NaN NaN NaN NaN 3.0 6.0 2.0
24 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25 1.0 1 3.0 3 5 3 5 4 3 4 ... 2.0 2.0 0.0 0.0 1.0 4.0 5.0 3.0 2.0 5.0
26 3.0 1 3.0 5 2 4 2 3 1 3 ... NaN NaN NaN NaN NaN NaN NaN 4.0 3.0 5.0
29 4.0 2 1.0 5 1 5 2 1 3 6 ... 3.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 4.0 3.0
30 3.0 2 3.0 4 3 4 4 4 1 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
34 1.0 2 2.0 4 1 5 1 2 3 2 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 4.0 7.0 5.0
35 2.0 2 2.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
39 4.0 1 4.0 5 3 4 3 1 1 6 ... 2.0 3.0 2.0 1.0 1.0 4.0 4.0 3.0 5.0 3.0
40 4.0 2 1.0 3 2 5 2 1 5 6 ... 3.0 2.0 1.0 0.0 1.0 2.0 3.0 3.0 5.0 3.0
42 1.0 2 5.0 3 4 2 5 4 2 3 ... 4.0 3.0 1.0 0.0 1.0 2.0 3.0 3.0 3.0 1.0
46 NaN 2 3.0 2 4 3 5 5 4 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
48 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
53 2.0 1 3.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
54 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
60 2.0 1 6.0 3 2 5 1 1 5 5 ... 1.0 4.0 3.0 2.0 4.0 4.0 2.0 3.0 7.0 2.0
61 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
62 3.0 1 4.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
64 2.0 2 6.0 1 4 2 4 4 4 1 ... 3.0 3.0 2.0 0.0 1.0 4.0 4.0 4.0 6.0 4.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
891129 3.0 2 5.0 1 5 3 5 4 5 1 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 2.0 3.0 3.0
891130 3.0 1 5.0 5 3 4 3 4 1 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891135 3.0 2 2.0 3 2 4 2 3 2 2 ... 1.0 3.0 3.0 2.0 5.0 4.0 2.0 4.0 8.0 5.0
891136 1.0 2 6.0 3 4 3 5 4 2 4 ... 3.0 3.0 0.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0
891137 4.0 2 4.0 4 1 5 2 3 3 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891139 3.0 2 5.0 1 5 2 5 2 5 1 ... 1.0 3.0 1.0 2.0 5.0 4.0 1.0 4.0 7.0 3.0
891140 4.0 2 1.0 4 1 5 2 3 3 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891150 4.0 2 3.0 2 2 5 2 1 5 6 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 2.0 3.0 3.0
891154 4.0 1 1.0 5 1 5 1 2 3 5 ... NaN NaN NaN NaN NaN NaN NaN 2.0 4.0 3.0
891155 3.0 2 6.0 5 3 2 3 5 1 3 ... 3.0 2.0 1.0 0.0 1.0 4.0 5.0 1.0 1.0 1.0
891157 1.0 1 2.0 5 1 5 1 2 2 5 ... 3.0 3.0 1.0 1.0 1.0 5.0 4.0 4.0 7.0 5.0
891159 2.0 1 1.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891161 2.0 2 3.0 1 4 2 5 4 3 1 ... 2.0 3.0 2.0 1.0 1.0 5.0 5.0 4.0 5.0 4.0
891162 4.0 2 4.0 4 1 5 1 2 5 2 ... 1.0 4.0 2.0 1.0 2.0 4.0 3.0 3.0 4.0 4.0
891164 3.0 2 4.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891169 3.0 1 1.0 5 2 4 2 3 1 3 ... NaN NaN NaN NaN NaN NaN NaN 1.0 2.0 1.0
891170 3.0 1 4.0 4 5 2 4 5 1 3 ... NaN NaN NaN NaN NaN NaN NaN 3.0 4.0 1.0
891171 3.0 2 5.0 4 2 3 5 5 1 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891172 1.0 2 1.0 1 4 3 5 5 2 1 ... NaN NaN NaN NaN NaN NaN NaN 3.0 4.0 4.0
891173 4.0 1 1.0 4 1 5 1 2 2 5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891175 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891181 2.0 2 5.0 2 3 4 4 3 3 1 ... 2.0 3.0 2.0 2.0 4.0 5.0 3.0 3.0 4.0 4.0
891184 3.0 2 6.0 1 5 3 5 4 5 1 ... 1.0 4.0 3.0 1.0 3.0 3.0 2.0 4.0 8.0 3.0
891185 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891187 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
891203 4.0 2 1.0 4 1 5 1 3 4 5 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 4.0 8.0 5.0
891208 4.0 1 2.0 5 2 4 2 3 1 6 ... 3.0 2.0 0.0 0.0 1.0 3.0 3.0 4.0 2.0 5.0
891216 3.0 2 5.0 1 4 2 5 4 4 1 ... 3.0 2.0 0.0 0.0 1.0 2.0 3.0 NaN NaN NaN
891217 2.0 1 4.0 3 3 3 2 2 3 6 ... 1.0 3.0 3.0 2.0 4.0 5.0 3.0 4.0 6.0 5.0
891218 2.0 2 4.0 2 4 2 5 4 3 1 ... 4.0 2.0 0.0 0.0 1.0 3.0 4.0 2.0 2.0 3.0

268012 rows × 79 columns

In [25]:
#Here's a column example, NaN values are reduced significantly
plt.hist(NaN_appeared["ORTSGR_KLS9"].astype(str))
Out[25]:
(array([ 97274.,  10609.,  17540.,  21765.,  27428.,  31909.,  14273.,
         18826.,  12389.,  15999.]),
 array([-1. , -0.1,  0.8,  1.7,  2.6,  3.5,  4.4,  5.3,  6.2,  7.1,  8. ]),
 <a list of 10 Patch objects>)
In [26]:
# And here, data that has no NaN values, completely clean data
NaN_missing
Out[26]:
ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
1 1.0 2 5.0 1 5 2 5 4 5 1 ... 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0
2 3.0 2 3.0 1 4 1 2 3 5 1 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0
4 3.0 1 5.0 4 3 4 1 3 2 5 ... 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0
5 1.0 2 2.0 3 1 5 2 2 5 2 ... 2.0 3.0 1.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0
6 2.0 2 5.0 1 5 1 5 4 3 4 ... 3.0 3.0 1.0 0.0 1.0 5.0 5.0 4.0 6.0 3.0
7 1.0 1 3.0 3 3 4 1 3 2 5 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 2.0 5.0 2.0
8 3.0 1 3.0 4 4 2 4 2 2 6 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 2.0 4.0 3.0
9 3.0 2 4.0 2 4 2 3 5 4 1 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 2.0 3.0 1.0
10 3.0 2 1.0 2 2 5 3 1 5 6 ... 2.0 4.0 2.0 0.0 2.0 3.0 3.0 4.0 6.0 5.0
19 3.0 1 3.0 5 2 3 1 3 1 5 ... 2.0 4.0 2.0 1.0 2.0 5.0 4.0 4.0 6.0 3.0
21 2.0 1 3.0 3 4 1 2 5 1 3 ... 1.0 4.0 3.0 2.0 4.0 4.0 3.0 5.0 7.0 5.0
22 1.0 1 4.0 1 5 3 5 5 5 1 ... 2.0 4.0 2.0 1.0 2.0 4.0 3.0 4.0 5.0 5.0
27 3.0 1 4.0 3 3 4 1 2 2 5 ... 2.0 3.0 3.0 2.0 4.0 5.0 3.0 4.0 7.0 5.0
28 3.0 1 2.0 3 2 4 3 3 2 3 ... 3.0 3.0 1.0 0.0 1.0 5.0 5.0 3.0 4.0 1.0
31 2.0 2 6.0 1 5 2 3 5 4 1 ... 1.0 3.0 3.0 2.0 5.0 4.0 2.0 4.0 8.0 5.0
32 1.0 1 4.0 5 4 2 3 4 1 3 ... 2.0 3.0 1.0 0.0 1.0 3.0 4.0 4.0 3.0 5.0
33 2.0 2 4.0 2 2 4 1 3 4 5 ... 1.0 3.0 2.0 1.0 5.0 5.0 3.0 3.0 8.0 5.0
36 3.0 2 2.0 2 3 4 3 2 3 6 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 4.0 9.0 4.0
37 4.0 1 2.0 5 1 5 1 1 3 6 ... 4.0 2.0 1.0 0.0 1.0 3.0 4.0 3.0 5.0 2.0
38 2.0 2 6.0 3 3 1 4 4 2 4 ... 4.0 2.0 0.0 0.0 1.0 3.0 4.0 3.0 5.0 2.0
41 1.0 2 3.0 2 5 3 5 5 2 1 ... 3.0 2.0 1.0 0.0 1.0 4.0 5.0 3.0 5.0 2.0
43 1.0 2 4.0 1 5 3 5 5 5 1 ... 2.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 7.0 2.0
44 3.0 2 4.0 3 2 4 4 2 2 6 ... 3.0 2.0 1.0 0.0 1.0 2.0 3.0 1.0 1.0 1.0
45 4.0 2 2.0 5 1 5 2 1 2 2 ... 3.0 2.0 1.0 0.0 1.0 5.0 5.0 3.0 2.0 3.0
47 4.0 2 2.0 4 1 5 1 2 3 5 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 3.0 5.0 3.0
49 2.0 1 6.0 4 5 1 4 4 2 4 ... 3.0 2.0 1.0 0.0 1.0 4.0 5.0 3.0 4.0 2.0
50 3.0 1 6.0 2 5 2 4 5 2 1 ... 2.0 3.0 0.0 1.0 1.0 3.0 4.0 2.0 1.0 1.0
51 1.0 1 5.0 2 5 3 5 5 2 1 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 3.0 3.0
52 1.0 2 1.0 1 4 2 2 4 3 1 ... 1.0 4.0 3.0 2.0 5.0 5.0 3.0 4.0 9.0 3.0
55 3.0 2 5.0 2 3 4 2 2 3 1 ... 1.0 4.0 3.0 1.0 3.0 5.0 4.0 3.0 8.0 5.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
891183 3.0 1 3.0 5 1 5 1 1 2 5 ... 4.0 3.0 1.0 0.0 1.0 3.0 3.0 4.0 8.0 4.0
891186 1.0 1 5.0 4 3 1 1 5 1 3 ... 2.0 3.0 1.0 0.0 1.0 3.0 3.0 4.0 8.0 3.0
891188 1.0 2 3.0 4 3 1 4 5 2 4 ... 3.0 3.0 1.0 0.0 1.0 2.0 2.0 2.0 1.0 1.0
891189 2.0 2 3.0 3 2 4 1 3 2 5 ... 4.0 2.0 1.0 0.0 1.0 3.0 4.0 5.0 8.0 5.0
891190 3.0 1 1.0 5 1 5 1 2 2 2 ... 3.0 2.0 2.0 1.0 1.0 4.0 4.0 2.0 5.0 3.0
891191 4.0 2 1.0 4 1 5 1 1 4 2 ... 3.0 3.0 1.0 0.0 1.0 4.0 5.0 1.0 3.0 1.0
891192 1.0 2 3.0 1 5 2 5 3 5 1 ... 4.0 2.0 1.0 0.0 1.0 3.0 3.0 3.0 5.0 1.0
891193 4.0 1 3.0 4 1 5 1 2 5 5 ... 2.0 4.0 1.0 0.0 2.0 2.0 2.0 4.0 6.0 4.0
891194 3.0 1 4.0 4 3 4 2 3 1 3 ... 3.0 2.0 2.0 0.0 1.0 4.0 4.0 4.0 8.0 5.0
891195 4.0 2 6.0 3 1 5 1 1 5 2 ... 2.0 3.0 2.0 1.0 1.0 4.0 3.0 4.0 7.0 5.0
891196 2.0 2 6.0 1 5 2 5 4 3 1 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 3.0 4.0 5.0
891197 3.0 2 1.0 3 2 4 1 3 3 5 ... 2.0 3.0 2.0 1.0 1.0 4.0 4.0 3.0 9.0 5.0
891198 3.0 1 5.0 2 5 3 4 5 3 1 ... 4.0 2.0 1.0 0.0 1.0 4.0 5.0 3.0 9.0 5.0
891199 2.0 1 3.0 2 5 2 3 3 4 1 ... 1.0 4.0 3.0 2.0 5.0 5.0 3.0 3.0 7.0 3.0
891200 1.0 2 3.0 1 5 3 5 5 5 1 ... 1.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0
891201 3.0 1 3.0 4 2 3 2 2 3 6 ... 1.0 4.0 3.0 1.0 3.0 4.0 3.0 3.0 7.0 5.0
891202 2.0 2 5.0 1 4 1 2 5 4 1 ... 1.0 4.0 3.0 2.0 5.0 4.0 2.0 4.0 8.0 5.0
891204 3.0 1 5.0 4 3 4 4 2 1 6 ... 2.0 3.0 1.0 0.0 1.0 3.0 4.0 4.0 7.0 3.0
891205 4.0 1 2.0 4 2 4 3 3 4 2 ... 1.0 4.0 3.0 1.0 3.0 3.0 2.0 3.0 7.0 5.0
891206 1.0 2 4.0 3 4 2 4 4 2 4 ... 3.0 2.0 1.0 0.0 1.0 4.0 5.0 3.0 4.0 3.0
891207 3.0 2 1.0 5 1 5 1 2 4 2 ... 2.0 3.0 2.0 2.0 4.0 4.0 2.0 1.0 5.0 1.0
891209 1.0 2 5.0 1 4 2 3 4 4 1 ... 1.0 4.0 2.0 1.0 5.0 5.0 4.0 4.0 5.0 5.0
891210 3.0 1 5.0 3 3 4 2 2 2 6 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 3.0 9.0 5.0
891211 3.0 1 2.0 3 2 4 3 3 2 2 ... 1.0 4.0 2.0 0.0 2.0 3.0 2.0 3.0 5.0 5.0
891212 4.0 1 1.0 3 1 5 1 1 5 5 ... 1.0 3.0 3.0 2.0 5.0 4.0 1.0 3.0 9.0 5.0
891213 4.0 2 5.0 3 3 3 5 3 2 6 ... 2.0 4.0 2.0 1.0 2.0 5.0 5.0 3.0 4.0 4.0
891214 1.0 2 4.0 1 5 2 3 3 4 1 ... 1.0 4.0 3.0 2.0 5.0 5.0 2.0 3.0 7.0 3.0
891215 2.0 2 6.0 1 5 2 4 5 4 1 ... 2.0 4.0 2.0 1.0 2.0 4.0 4.0 2.0 5.0 2.0
891219 1.0 1 3.0 1 5 3 5 5 5 1 ... 1.0 4.0 3.0 1.0 5.0 1.0 1.0 4.0 7.0 5.0
891220 4.0 1 1.0 4 2 5 2 1 5 6 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 4.0 5.0

623209 rows × 79 columns

In [27]:
# Same column in the Nan_missing frame, 0 NaN
plt.hist(NaN_missing["ORTSGR_KLS9"].astype(str))
Out[27]:
(array([  29980.,   45822.,   61777.,   87481.,       0.,  116187.,
          61722.,   84040.,   60320.,   75880.]),
 array([ 0. ,  0.8,  1.6,  2.4,  3.2,  4. ,  4.8,  5.6,  6.4,  7.2,  8. ]),
 <a list of 10 Patch objects>)
In [28]:
print(f"After cleaning data, NaN appeared at least once in {len(NaN_appeared)} rows, and didn't appear at all in {len(NaN_missing)} rows \
that's total of {len(NaN_appeared) + len(NaN_missing)} rows")
After cleaning data, NaN appeared at least once in 268012 rows, and didn't appear at all in 623209 rows that's total of 891221 rows
In [29]:
# Compare the distribution of values for at least five columns where there are
# no or few missing values, between the two subsets.

data = [NaN_missing, NaN_appeared]
NaN_missing.name = "NaN_missing"
NaN_appeared.name = "NaN_appeared"


for d in data:
    print(f"5 columns in the {d.name} DataFrame\n")
    print("ORTSGR_KLS9")
    sns.countplot(data = d, x = "ORTSGR_KLS9")
    plt.show()
    print("FINANZ_HAUSBAUER")
    sns.countplot(data = d, x = "FINANZ_HAUSBAUER")
    plt.show()
    print("PLZ8_ANTG3")
    sns.countplot(data = d, x = "PLZ8_ANTG3")
    plt.show()
    print("PLZ8_GBZ")
    sns.countplot(data = d, x = "PLZ8_GBZ")
    plt.show()
    print("RELAT_AB")
    sns.countplot(data = d, x = "RELAT_AB")
    plt.show()
5 columns in the NaN_missing DataFrame

ORTSGR_KLS9
FINANZ_HAUSBAUER
PLZ8_ANTG3
PLZ8_GBZ
RELAT_AB
5 columns in the NaN_appeared DataFrame

ORTSGR_KLS9
FINANZ_HAUSBAUER
PLZ8_ANTG3
PLZ8_GBZ
RELAT_AB
In [30]:
frames = [NaN_appeared, NaN_missing]
new_azdias = pd.concat(frames)
new_azdias
Out[30]:
ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 2.0 1 2.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4.0 2 2.0 4 2 5 2 1 2 6 ... 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0
11 2.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 3.0 1 6.0 5 3 4 2 4 1 3 ... 3.0 3.0 1.0 0.0 1.0 5.0 5.0 3.0 6.0 4.0
13 1.0 2 5.0 1 4 3 5 5 2 1 ... 2.0 1.0 1.0 1.0 1.0 3.0 3.0 3.0 6.0 4.0
14 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 4.0 2 4.0 4 1 5 1 1 4 2 ... NaN NaN NaN NaN NaN NaN NaN 4.0 8.0 5.0
16 1.0 2 1.0 4 3 1 4 5 1 3 ... 3.0 3.0 1.0 0.0 1.0 3.0 4.0 1.0 2.0 1.0
17 2.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 2.0 2 6.0 2 4 1 5 4 1 1 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 3.0 4.0 3.0
20 2.0 2 4.0 4 3 1 4 5 1 3 ... NaN NaN NaN NaN NaN NaN NaN 3.0 4.0 1.0
23 3.0 1 3.0 5 3 3 2 2 1 6 ... NaN NaN NaN NaN NaN NaN NaN 3.0 6.0 2.0
24 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25 1.0 1 3.0 3 5 3 5 4 3 4 ... 2.0 2.0 0.0 0.0 1.0 4.0 5.0 3.0 2.0 5.0
26 3.0 1 3.0 5 2 4 2 3 1 3 ... NaN NaN NaN NaN NaN NaN NaN 4.0 3.0 5.0
29 4.0 2 1.0 5 1 5 2 1 3 6 ... 3.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 4.0 3.0
30 3.0 2 3.0 4 3 4 4 4 1 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
34 1.0 2 2.0 4 1 5 1 2 3 2 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 4.0 7.0 5.0
35 2.0 2 2.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
39 4.0 1 4.0 5 3 4 3 1 1 6 ... 2.0 3.0 2.0 1.0 1.0 4.0 4.0 3.0 5.0 3.0
40 4.0 2 1.0 3 2 5 2 1 5 6 ... 3.0 2.0 1.0 0.0 1.0 2.0 3.0 3.0 5.0 3.0
42 1.0 2 5.0 3 4 2 5 4 2 3 ... 4.0 3.0 1.0 0.0 1.0 2.0 3.0 3.0 3.0 1.0
46 NaN 2 3.0 2 4 3 5 5 4 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
48 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
53 2.0 1 3.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
54 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
60 2.0 1 6.0 3 2 5 1 1 5 5 ... 1.0 4.0 3.0 2.0 4.0 4.0 2.0 3.0 7.0 2.0
61 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
62 3.0 1 4.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
64 2.0 2 6.0 1 4 2 4 4 4 1 ... 3.0 3.0 2.0 0.0 1.0 4.0 4.0 4.0 6.0 4.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
891183 3.0 1 3.0 5 1 5 1 1 2 5 ... 4.0 3.0 1.0 0.0 1.0 3.0 3.0 4.0 8.0 4.0
891186 1.0 1 5.0 4 3 1 1 5 1 3 ... 2.0 3.0 1.0 0.0 1.0 3.0 3.0 4.0 8.0 3.0
891188 1.0 2 3.0 4 3 1 4 5 2 4 ... 3.0 3.0 1.0 0.0 1.0 2.0 2.0 2.0 1.0 1.0
891189 2.0 2 3.0 3 2 4 1 3 2 5 ... 4.0 2.0 1.0 0.0 1.0 3.0 4.0 5.0 8.0 5.0
891190 3.0 1 1.0 5 1 5 1 2 2 2 ... 3.0 2.0 2.0 1.0 1.0 4.0 4.0 2.0 5.0 3.0
891191 4.0 2 1.0 4 1 5 1 1 4 2 ... 3.0 3.0 1.0 0.0 1.0 4.0 5.0 1.0 3.0 1.0
891192 1.0 2 3.0 1 5 2 5 3 5 1 ... 4.0 2.0 1.0 0.0 1.0 3.0 3.0 3.0 5.0 1.0
891193 4.0 1 3.0 4 1 5 1 2 5 5 ... 2.0 4.0 1.0 0.0 2.0 2.0 2.0 4.0 6.0 4.0
891194 3.0 1 4.0 4 3 4 2 3 1 3 ... 3.0 2.0 2.0 0.0 1.0 4.0 4.0 4.0 8.0 5.0
891195 4.0 2 6.0 3 1 5 1 1 5 2 ... 2.0 3.0 2.0 1.0 1.0 4.0 3.0 4.0 7.0 5.0
891196 2.0 2 6.0 1 5 2 5 4 3 1 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 3.0 4.0 5.0
891197 3.0 2 1.0 3 2 4 1 3 3 5 ... 2.0 3.0 2.0 1.0 1.0 4.0 4.0 3.0 9.0 5.0
891198 3.0 1 5.0 2 5 3 4 5 3 1 ... 4.0 2.0 1.0 0.0 1.0 4.0 5.0 3.0 9.0 5.0
891199 2.0 1 3.0 2 5 2 3 3 4 1 ... 1.0 4.0 3.0 2.0 5.0 5.0 3.0 3.0 7.0 3.0
891200 1.0 2 3.0 1 5 3 5 5 5 1 ... 1.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0
891201 3.0 1 3.0 4 2 3 2 2 3 6 ... 1.0 4.0 3.0 1.0 3.0 4.0 3.0 3.0 7.0 5.0
891202 2.0 2 5.0 1 4 1 2 5 4 1 ... 1.0 4.0 3.0 2.0 5.0 4.0 2.0 4.0 8.0 5.0
891204 3.0 1 5.0 4 3 4 4 2 1 6 ... 2.0 3.0 1.0 0.0 1.0 3.0 4.0 4.0 7.0 3.0
891205 4.0 1 2.0 4 2 4 3 3 4 2 ... 1.0 4.0 3.0 1.0 3.0 3.0 2.0 3.0 7.0 5.0
891206 1.0 2 4.0 3 4 2 4 4 2 4 ... 3.0 2.0 1.0 0.0 1.0 4.0 5.0 3.0 4.0 3.0
891207 3.0 2 1.0 5 1 5 1 2 4 2 ... 2.0 3.0 2.0 2.0 4.0 4.0 2.0 1.0 5.0 1.0
891209 1.0 2 5.0 1 4 2 3 4 4 1 ... 1.0 4.0 2.0 1.0 5.0 5.0 4.0 4.0 5.0 5.0
891210 3.0 1 5.0 3 3 4 2 2 2 6 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 3.0 9.0 5.0
891211 3.0 1 2.0 3 2 4 3 3 2 2 ... 1.0 4.0 2.0 0.0 2.0 3.0 2.0 3.0 5.0 5.0
891212 4.0 1 1.0 3 1 5 1 1 5 5 ... 1.0 3.0 3.0 2.0 5.0 4.0 1.0 3.0 9.0 5.0
891213 4.0 2 5.0 3 3 3 5 3 2 6 ... 2.0 4.0 2.0 1.0 2.0 5.0 5.0 3.0 4.0 4.0
891214 1.0 2 4.0 1 5 2 3 3 4 1 ... 1.0 4.0 3.0 2.0 5.0 5.0 2.0 3.0 7.0 3.0
891215 2.0 2 6.0 1 5 2 4 5 4 1 ... 2.0 4.0 2.0 1.0 2.0 4.0 4.0 2.0 5.0 2.0
891219 1.0 1 3.0 1 5 3 5 5 5 1 ... 1.0 4.0 3.0 1.0 5.0 1.0 1.0 4.0 7.0 5.0
891220 4.0 1 1.0 4 2 5 2 1 5 6 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 4.0 5.0

891221 rows × 79 columns

Discussion 1.1.3: Assess Missing Data in Each Row

(Double-click this cell and replace this text with your own text, reporting your observations regarding missing data in rows. Are the data with lots of missing values are qualitatively different from data with few or no missing values?)

ANSWER

Rows with a lot of missing values are significant, and needs to be dealt with accordingly

Step 1.2: Select and Re-Encode Features

Checking for missing data isn't the only way in which you can prepare a dataset for analysis. Since the unsupervised learning techniques to be used will only work on data that is encoded numerically, you need to make a few encoding changes or additional assumptions to be able to make progress. In addition, while almost all of the values in the dataset are encoded using numbers, not all of them represent numeric values. Check the third column of the feature summary (feat_info) for a summary of types of measurement.

  • For numeric and interval data, these features can be kept without changes.
  • Most of the variables in the dataset are ordinal in nature. While ordinal values may technically be non-linear in spacing, make the simplifying assumption that the ordinal variables can be treated as being interval in nature (that is, kept without any changes).
  • Special handling may be necessary for the remaining two variable types: categorical, and 'mixed'.

In the first two parts of this sub-step, you will perform an investigation of the categorical and mixed-type features and make a decision on each of them, whether you will keep, drop, or re-encode each. Then, in the last part, you will create a new data frame with only the selected and engineered columns.

Data wrangling is often the trickiest part of the data analysis process, and there's a lot of it to be done here. But stick with it: once you're done with this step, you'll be ready to get to the machine learning parts of the project!

In [31]:
# How many features are there of each data type?

data_type = feat_info["type"].value_counts()
data_type
Out[31]:
ordinal        49
categorical    21
numeric         7
mixed           7
interval        1
Name: type, dtype: int64

Step 1.2.1: Re-Encode Categorical Features

For categorical data, you would ordinarily need to encode the levels as dummy variables. Depending on the number of categories, perform one of the following:

  • For binary (two-level) categoricals that take numeric values, you can keep them without needing to do anything.
  • There is one binary variable that takes on non-numeric values. For this one, you need to re-encode the values as numbers or create a dummy variable.
  • For multi-level categoricals (three or more values), you can choose to encode the values using multiple dummy variables (e.g. via OneHotEncoder), or (to keep things straightforward) just drop them from the analysis. As always, document your choices in the Discussion section.
In [32]:
# Assess categorical variables: which are binary, which are multi-level, and
# which one needs to be re-encoded?

# Dropped columns that were already dropped before
categorical_columns = (feat_info.loc[feat_info['type'] == "categorical"]["attribute"]).drop([39,40])
categorical_columns
Out[32]:
0             AGER_TYP
2            ANREDE_KZ
3        CJT_GESAMTTYP
10           FINANZTYP
12     GFK_URLAUBERTYP
13    GREEN_AVANTGARDE
17     LP_FAMILIE_FEIN
18     LP_FAMILIE_GROB
19      LP_STATUS_FEIN
20      LP_STATUS_GROB
21    NATIONALITAET_KZ
38         SHOPPER_TYP
41            VERS_TYP
42            ZABEOTYP
47        KK_KUNDENTYP
52         GEBAEUDETYP
55         OST_WEST_KZ
57     CAMEO_DEUG_2015
58      CAMEO_DEU_2015
Name: attribute, dtype: object
In [33]:
# These are the categorical columns, 0 and 47 were removed before so I dropped them here
new_azdias[categorical_columns.drop([0, 47])]
Out[33]:
ANREDE_KZ CJT_GESAMTTYP FINANZTYP GFK_URLAUBERTYP GREEN_AVANTGARDE LP_FAMILIE_FEIN LP_FAMILIE_GROB LP_STATUS_FEIN LP_STATUS_GROB NATIONALITAET_KZ SHOPPER_TYP VERS_TYP ZABEOTYP GEBAEUDETYP OST_WEST_KZ CAMEO_DEUG_2015 CAMEO_DEU_2015
0 1 2.0 4 10.0 0 2.0 2.0 1.0 1.0 NaN NaN NaN 3 NaN NaN NaN NaN
3 2 2.0 6 1.0 0 NaN NaN 9.0 4.0 1.0 1.0 1.0 3 1.0 W 2 2A
11 1 6.0 4 5.0 0 NaN NaN 5.0 2.0 NaN NaN NaN 3 NaN NaN NaN NaN
12 1 6.0 3 10.0 0 5.0 3.0 9.0 4.0 1.0 1.0 1.0 1 1.0 W 6 6B
13 2 5.0 1 12.0 1 1.0 1.0 5.0 2.0 1.0 3.0 2.0 5 1.0 W 5 5C
14 1 6.0 4 5.0 0 NaN NaN 5.0 2.0 NaN NaN NaN 3 NaN NaN NaN NaN
15 2 4.0 2 12.0 0 1.0 1.0 8.0 4.0 1.0 1.0 2.0 3 8.0 W 8 8B
16 2 1.0 3 10.0 0 NaN NaN 8.0 4.0 3.0 2.0 2.0 5 2.0 W 7 7A
17 1 6.0 4 5.0 0 NaN NaN 5.0 2.0 NaN NaN NaN 3 NaN NaN NaN NaN
18 2 6.0 1 11.0 0 1.0 1.0 8.0 4.0 2.0 1.0 1.0 4 1.0 W 4 4C
20 2 4.0 3 11.0 1 1.0 1.0 10.0 5.0 1.0 1.0 1.0 3 3.0 W 4 4C
23 1 3.0 6 12.0 1 2.0 2.0 10.0 5.0 2.0 0.0 2.0 4 1.0 W 6 6B
24 2 6.0 4 5.0 0 NaN NaN 5.0 2.0 NaN NaN NaN 3 NaN NaN NaN NaN
25 1 3.0 4 1.0 0 NaN NaN 8.0 4.0 1.0 0.0 2.0 5 1.0 W 5 5C
26 1 3.0 3 2.0 0 1.0 1.0 8.0 4.0 1.0 1.0 1.0 1 1.0 W NaN NaN
29 2 1.0 6 5.0 1 1.0 1.0 10.0 5.0 1.0 1.0 2.0 1 1.0 W 2 2A
30 2 3.0 3 11.0 0 1.0 1.0 4.0 2.0 1.0 3.0 2.0 3 NaN NaN NaN NaN
34 2 2.0 2 5.0 1 1.0 1.0 10.0 5.0 NaN NaN NaN 1 1.0 W 3 3D
35 2 2.0 4 7.0 0 2.0 2.0 9.0 4.0 NaN NaN NaN 3 NaN NaN NaN NaN
39 1 4.0 6 10.0 1 11.0 5.0 10.0 5.0 1.0 0.0 1.0 1 1.0 W NaN NaN
40 2 1.0 6 4.0 0 1.0 1.0 4.0 2.0 1.0 1.0 1.0 3 3.0 W NaN NaN
42 2 5.0 3 10.0 0 1.0 1.0 8.0 4.0 1.0 0.0 2.0 1 1.0 W 4 4B
46 2 3.0 1 1.0 0 1.0 1.0 2.0 1.0 1.0 NaN NaN 3 NaN NaN NaN NaN
48 2 6.0 4 5.0 0 NaN NaN 5.0 2.0 NaN NaN NaN 3 NaN NaN NaN NaN
53 1 3.0 4 9.0 0 10.0 5.0 6.0 3.0 NaN NaN NaN 3 NaN NaN NaN NaN
54 1 6.0 4 5.0 0 NaN NaN 5.0 2.0 NaN NaN NaN 3 NaN NaN NaN NaN
60 1 6.0 5 3.0 0 1.0 1.0 3.0 2.0 NaN NaN NaN 6 1.0 W 8 8A
61 2 6.0 4 1.0 0 3.0 3.0 4.0 2.0 NaN NaN NaN 3 NaN NaN NaN NaN
62 1 4.0 4 9.0 0 1.0 1.0 2.0 1.0 NaN NaN NaN 3 NaN NaN NaN NaN
64 2 6.0 1 2.0 0 NaN NaN 2.0 1.0 1.0 2.0 1.0 5 1.0 W 9 9A
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
891183 1 3.0 5 8.0 1 11.0 5.0 10.0 5.0 1.0 1.0 2.0 1 1.0 W 5 5E
891186 1 5.0 3 7.0 1 9.0 5.0 10.0 5.0 1.0 1.0 1.0 1 1.0 W 6 6E
891188 2 3.0 4 1.0 0 1.0 1.0 9.0 4.0 1.0 2.0 2.0 4 8.0 W 3 3C
891189 2 3.0 5 12.0 1 1.0 1.0 3.0 2.0 1.0 3.0 1.0 4 1.0 W 3 3D
891190 1 1.0 2 3.0 1 1.0 1.0 10.0 5.0 1.0 0.0 2.0 1 1.0 W 1 1D
891191 2 1.0 2 6.0 0 1.0 1.0 4.0 2.0 1.0 3.0 2.0 3 1.0 W 6 6C
891192 2 3.0 1 10.0 0 10.0 5.0 2.0 1.0 1.0 0.0 1.0 4 3.0 W 5 5B
891193 1 3.0 5 4.0 0 2.0 2.0 3.0 2.0 1.0 2.0 1.0 4 1.0 W 6 6B
891194 1 4.0 3 9.0 1 7.0 4.0 10.0 5.0 1.0 0.0 2.0 4 1.0 W 4 4C
891195 2 6.0 2 5.0 0 10.0 5.0 3.0 2.0 1.0 3.0 2.0 3 1.0 W 8 8D
891196 2 6.0 1 12.0 0 7.0 4.0 4.0 2.0 2.0 1.0 1.0 4 8.0 W 4 4B
891197 2 1.0 5 5.0 1 1.0 1.0 10.0 5.0 1.0 2.0 2.0 4 3.0 W 3 3C
891198 1 5.0 1 10.0 1 1.0 1.0 5.0 2.0 1.0 1.0 2.0 4 1.0 W 2 2C
891199 1 3.0 1 11.0 0 1.0 1.0 1.0 1.0 2.0 2.0 1.0 4 8.0 W 8 8A
891200 2 3.0 1 12.0 0 10.0 5.0 2.0 1.0 2.0 3.0 2.0 4 1.0 W 9 9A
891201 1 3.0 6 7.0 0 1.0 1.0 3.0 2.0 1.0 2.0 2.0 4 3.0 W 7 7A
891202 2 5.0 1 10.0 0 1.0 1.0 1.0 1.0 1.0 3.0 1.0 4 3.0 W 8 8B
891204 1 5.0 6 11.0 0 11.0 5.0 9.0 4.0 1.0 0.0 2.0 1 3.0 W 3 3C
891205 1 2.0 2 5.0 0 10.0 5.0 1.0 1.0 1.0 2.0 1.0 3 1.0 W 5 5E
891206 2 4.0 4 12.0 0 1.0 1.0 9.0 4.0 1.0 0.0 2.0 5 1.0 W 3 3D
891207 2 1.0 2 8.0 1 2.0 2.0 10.0 5.0 1.0 3.0 1.0 3 1.0 W 7 7A
891209 2 5.0 1 9.0 0 10.0 5.0 1.0 1.0 3.0 2.0 2.0 5 1.0 W 8 8A
891210 1 5.0 6 4.0 0 1.0 1.0 1.0 1.0 1.0 2.0 2.0 3 8.0 W 8 8B
891211 1 2.0 2 1.0 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 6 1.0 W 9 9D
891212 1 1.0 5 4.0 0 1.0 1.0 1.0 1.0 1.0 2.0 1.0 3 3.0 W 9 9D
891213 2 5.0 6 8.0 1 11.0 5.0 6.0 3.0 1.0 1.0 1.0 4 3.0 W 5 5E
891214 2 4.0 1 10.0 0 1.0 1.0 1.0 1.0 1.0 3.0 2.0 5 1.0 W 3 3A
891215 2 6.0 1 12.0 0 1.0 1.0 2.0 1.0 2.0 1.0 1.0 4 1.0 W 5 5A
891219 1 3.0 1 9.0 0 9.0 5.0 2.0 1.0 1.0 0.0 2.0 5 8.0 W 9 9D
891220 1 1.0 6 12.0 0 1.0 1.0 1.0 1.0 1.0 2.0 1.0 3 8.0 W 6 6B

891221 rows × 17 columns

In [34]:
# I will check each column to see if it's binary, or if there's a value needs to be encoded, or multi level
# 0 and 47 were removed before so I dropped them here
for i in categorical_columns.drop([0,47]):
    print(f"Number of unique values in column {i} is\n {new_azdias[i].value_counts()}\n\n")
Number of unique values in column ANREDE_KZ is
 2    465305
1    425916
Name: ANREDE_KZ, dtype: int64


Number of unique values in column CJT_GESAMTTYP is
 4.0    210963
3.0    156449
6.0    153915
2.0    148795
5.0    117376
1.0     98869
Name: CJT_GESAMTTYP, dtype: int64


Number of unique values in column FINANZTYP is
 6    290367
1    199572
4    130625
2    110867
5    106436
3     53354
Name: FINANZTYP, dtype: int64


Number of unique values in column GFK_URLAUBERTYP is
 12.0    138545
5.0     120126
10.0    109127
8.0      88042
11.0     79740
4.0      63770
9.0      60614
3.0      56007
1.0      53600
2.0      46702
7.0      42956
6.0      27138
Name: GFK_URLAUBERTYP, dtype: int64


Number of unique values in column GREEN_AVANTGARDE is
 0    715996
1    175225
Name: GREEN_AVANTGARDE, dtype: int64


Number of unique values in column LP_FAMILIE_FEIN is
 1.0     426379
10.0    137913
2.0     104305
11.0     51719
8.0      23032
7.0      20730
4.0      12303
5.0      11920
9.0      11148
6.0       9022
3.0       4958
Name: LP_FAMILIE_FEIN, dtype: int64


Number of unique values in column LP_FAMILIE_GROB is
 1.0    426379
5.0    200780
2.0    104305
4.0     52784
3.0     29181
Name: LP_FAMILIE_GROB, dtype: int64


Number of unique values in column LP_STATUS_FEIN is
 1.0     219275
9.0     143238
2.0     118236
10.0    118022
4.0      78317
5.0      74493
3.0      74105
6.0      30914
8.0      19708
7.0      10059
Name: LP_STATUS_FEIN, dtype: int64


Number of unique values in column LP_STATUS_GROB is
 1.0    337511
2.0    226915
4.0    162946
5.0    118022
3.0     40973
Name: LP_STATUS_GROB, dtype: int64


Number of unique values in column NATIONALITAET_KZ is
 1.0    684085
2.0     65418
3.0     33403
Name: NATIONALITAET_KZ, dtype: int64


Number of unique values in column SHOPPER_TYP is
 1.0    254761
2.0    207463
3.0    190219
0.0    127582
Name: SHOPPER_TYP, dtype: int64


Number of unique values in column VERS_TYP is
 2.0    398722
1.0    381303
Name: VERS_TYP, dtype: int64


Number of unique values in column ZABEOTYP is
 3    364905
4    210095
1    123622
5     84956
6     74473
2     33170
Name: ZABEOTYP, dtype: int64


Number of unique values in column GEBAEUDETYP is
 1.0    460465
3.0    178668
8.0    152476
2.0      4935
4.0       900
6.0       628
5.0         1
Name: GEBAEUDETYP, dtype: int64


Number of unique values in column OST_WEST_KZ is
 W    629528
O    168545
Name: OST_WEST_KZ, dtype: int64


Number of unique values in column CAMEO_DEUG_2015 is
 8    134441
9    108177
6    105874
4    103912
3     86779
2     83231
7     77933
5     55310
1     36212
Name: CAMEO_DEUG_2015, dtype: int64


Number of unique values in column CAMEO_DEU_2015 is
 6B    56672
8A    52438
4C    47819
2D    35074
3C    34769
7A    34399
3D    34307
8B    33434
4A    33155
8C    30993
9D    28593
9B    27676
9C    24987
7B    24503
9A    20542
2C    19422
8D    17576
6E    16107
2B    15486
5D    14943
6C    14820
2A    13249
5A    12214
1D    11909
1A    10850
3A    10543
5B    10354
5C     9935
7C     9065
4B     9047
4D     8570
3B     7160
6A     6810
9E     6379
6D     6073
6F     5392
7D     5333
4E     5321
1E     5065
7E     4633
1C     4317
5F     4283
1B     4071
5E     3581
Name: CAMEO_DEU_2015, dtype: int64


In [35]:
#From observation, I can say ANREDE_KZ, GREEN_AVANTGARDE and SOHO_KZ are binary, I will not change anything with them.
#OST_WEST_KZ contains 2 values, W and O, I can make it a dummy variable.
#CAMEO_DEU_2015 will be dropped.
In [36]:
# Re-encode categorical variable(s) to be kept in the analysis.

new_azdias = pd.get_dummies(new_azdias, columns=["OST_WEST_KZ"])
new_azdias = new_azdias.drop(["CAMEO_DEU_2015"], axis=1)

Discussion 1.2.1: Re-Encode Categorical Features

(Double-click this cell and replace this text with your own text, reporting your findings and decisions regarding categorical features. Which ones did you keep, which did you drop, and what engineering steps did you perform?)

ANSWER:

I kept most of the features because most of them were numbers and will not affect my analysis negatively. However, I dropped a column called CAMEO_DEU_2015 because it's not feasable to turn it into dummy variable. I turned OST_WEST_KZ into dummy variable, the only column I turned so far.

Step 1.2.2: Engineer Mixed-Type Features

There are a handful of features that are marked as "mixed" in the feature summary that require special treatment in order to be included in the analysis. There are two in particular that deserve attention; the handling of the rest are up to your own choices:

  • "PRAEGENDE_JUGENDJAHRE" combines information on three dimensions: generation by decade, movement (mainstream vs. avantgarde), and nation (east vs. west). While there aren't enough levels to disentangle east from west, you should create two new variables to capture the other two dimensions: an interval-type variable for decade, and a binary variable for movement.
  • "CAMEO_INTL_2015" combines information on two axes: wealth and life stage. Break up the two-digit codes by their 'tens'-place and 'ones'-place digits into two new ordinal variables (which, for the purposes of this project, is equivalent to just treating them as their raw numeric values).
  • If you decide to keep or engineer new features around the other mixed-type features, make sure you note your steps in the Discussion section.

Be sure to check Data_Dictionary.md for the details needed to finish these tasks.

In [37]:
mixed_columns = feat_info.loc[feat_info["type"] == "mixed"]["attribute"]
mixed_columns
Out[37]:
15      LP_LEBENSPHASE_FEIN
16      LP_LEBENSPHASE_GROB
22    PRAEGENDE_JUGENDJAHRE
56                 WOHNLAGE
59          CAMEO_INTL_2015
64             KBA05_BAUMAX
79              PLZ8_BAUMAX
Name: attribute, dtype: object
In [38]:
# Investigate "PRAEGENDE_JUGENDJAHRE" and engineer two new variables.
print(new_azdias["PRAEGENDE_JUGENDJAHRE"].value_counts())

sns.countplot(data = new_azdias, x = "PRAEGENDE_JUGENDJAHRE")
plt.show()
14.0    188697
8.0     145988
5.0      86416
10.0     85808
3.0      55195
15.0     42547
11.0     35752
9.0      33570
6.0      25652
12.0     24446
1.0      21282
4.0      20451
2.0       7479
13.0      5764
7.0       4010
Name: PRAEGENDE_JUGENDJAHRE, dtype: int64
In [39]:
# I decided to write down the attributes in a dictionary, easier to work with
PRAEGENDE_JUGENDJAHRE_DECADE = {1:40, 2:40, 3:50,
                                   4:50, 5:60, 6:60,
                                   7:60, 8:70, 9:70,
                                   10:80, 11:80, 12:80,
                                   13:80, 14:90, 15:90}

PRAEGENDE_JUGENDJAHRE_IS_MAINSTREAM = {1:True, 2:False, 3:True, 4:False, 5:True,
                                  6:True, 7:False, 8:True, 9:False, 10:True,
                                  11:False, 12:True, 13:False, 14:True, 15:False}


print(PRAEGENDE_JUGENDJAHRE_DECADE)
print(PRAEGENDE_JUGENDJAHRE_IS_MAINSTREAM)
{1: 40, 2: 40, 3: 50, 4: 50, 5: 60, 6: 60, 7: 60, 8: 70, 9: 70, 10: 80, 11: 80, 12: 80, 13: 80, 14: 90, 15: 90}
{1: True, 2: False, 3: True, 4: False, 5: True, 6: True, 7: False, 8: True, 9: False, 10: True, 11: False, 12: True, 13: False, 14: True, 15: False}
In [40]:
# Adding the new columns, and dropping PRAEGENDE_JUGENDJAHRE

new_azdias["PRAEGENDE_JUGENDJAHRE_DECADE"] = new_azdias["PRAEGENDE_JUGENDJAHRE"].replace(PRAEGENDE_JUGENDJAHRE_DECADE)


new_azdias["PRAEGENDE_JUGENDJAHRE_IS_MAINSTREAM"] = new_azdias["PRAEGENDE_JUGENDJAHRE"].replace(PRAEGENDE_JUGENDJAHRE_IS_MAINSTREAM)


# Dropiing the PRAEGENDE_JUGENDJAHRE column
new_azdias = new_azdias.drop(["PRAEGENDE_JUGENDJAHRE"], axis=1)

new_azdias
Out[40]:
ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP ... PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB OST_WEST_KZ_O OST_WEST_KZ_W PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_IS_MAINSTREAM
0 2.0 1 2.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN 0 0 NaN NaN
3 4.0 2 2.0 4 2 5 2 1 2 6 ... 1.0 3.0 4.0 2.0 3.0 3.0 0 1 70.0 True
11 2.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN 0 0 NaN NaN
12 3.0 1 6.0 5 3 4 2 4 1 3 ... 1.0 5.0 5.0 3.0 6.0 4.0 0 1 70.0 True
13 1.0 2 5.0 1 4 3 5 5 2 1 ... 1.0 3.0 3.0 3.0 6.0 4.0 0 1 90.0 False
14 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN 0 0 NaN NaN
15 4.0 2 4.0 4 1 5 1 1 4 2 ... NaN NaN NaN 4.0 8.0 5.0 0 1 60.0 True
16 1.0 2 1.0 4 3 1 4 5 1 3 ... 1.0 3.0 4.0 1.0 2.0 1.0 0 1 90.0 True
17 2.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN 0 0 NaN NaN
18 2.0 2 6.0 2 4 1 5 4 1 1 ... 1.0 3.0 3.0 3.0 4.0 3.0 0 1 80.0 True
20 2.0 2 4.0 4 3 1 4 5 1 3 ... NaN NaN NaN 3.0 4.0 1.0 0 1 80.0 False
23 3.0 1 3.0 5 3 3 2 2 1 6 ... NaN NaN NaN 3.0 6.0 2.0 0 1 70.0 False
24 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN 0 0 NaN NaN
25 1.0 1 3.0 3 5 3 5 4 3 4 ... 1.0 4.0 5.0 3.0 2.0 5.0 0 1 90.0 True
26 3.0 1 3.0 5 2 4 2 3 1 3 ... NaN NaN NaN 4.0 3.0 5.0 0 1 70.0 True
29 4.0 2 1.0 5 1 5 2 1 3 6 ... 1.0 3.0 4.0 2.0 4.0 3.0 0 1 60.0 True
30 3.0 2 3.0 4 3 4 4 4 1 3 ... NaN NaN NaN NaN NaN NaN 0 0 70.0 True
34 1.0 2 2.0 4 1 5 1 2 3 2 ... 1.0 3.0 3.0 4.0 7.0 5.0 0 1 50.0 False
35 2.0 2 2.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN 0 0 NaN NaN
39 4.0 1 4.0 5 3 4 3 1 1 6 ... 1.0 4.0 4.0 3.0 5.0 3.0 0 1 80.0 False
40 4.0 2 1.0 3 2 5 2 1 5 6 ... 1.0 2.0 3.0 3.0 5.0 3.0 0 1 NaN NaN
42 1.0 2 5.0 3 4 2 5 4 2 3 ... 1.0 2.0 3.0 3.0 3.0 1.0 0 1 90.0 True
46 NaN 2 3.0 2 4 3 5 5 4 1 ... NaN NaN NaN NaN NaN NaN 0 0 70.0 True
48 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN 0 0 NaN NaN
53 2.0 1 3.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN 0 0 NaN NaN
54 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN 0 0 NaN NaN
60 2.0 1 6.0 3 2 5 1 1 5 5 ... 4.0 4.0 2.0 3.0 7.0 2.0 0 1 70.0 True
61 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN 0 0 NaN NaN
62 3.0 1 4.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN 0 0 NaN NaN
64 2.0 2 6.0 1 4 2 4 4 4 1 ... 1.0 4.0 4.0 4.0 6.0 4.0 0 1 90.0 True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
891183 3.0 1 3.0 5 1 5 1 1 2 5 ... 1.0 3.0 3.0 4.0 8.0 4.0 0 1 50.0 False
891186 1.0 1 5.0 4 3 1 1 5 1 3 ... 1.0 3.0 3.0 4.0 8.0 3.0 0 1 90.0 False
891188 1.0 2 3.0 4 3 1 4 5 2 4 ... 1.0 2.0 2.0 2.0 1.0 1.0 0 1 90.0 True
891189 2.0 2 3.0 3 2 4 1 3 2 5 ... 1.0 3.0 4.0 5.0 8.0 5.0 0 1 80.0 False
891190 3.0 1 1.0 5 1 5 1 2 2 2 ... 1.0 4.0 4.0 2.0 5.0 3.0 0 1 50.0 False
891191 4.0 2 1.0 4 1 5 1 1 4 2 ... 1.0 4.0 5.0 1.0 3.0 1.0 0 1 50.0 True
891192 1.0 2 3.0 1 5 2 5 3 5 1 ... 1.0 3.0 3.0 3.0 5.0 1.0 0 1 90.0 True
891193 4.0 1 3.0 4 1 5 1 2 5 5 ... 2.0 2.0 2.0 4.0 6.0 4.0 0 1 60.0 True
891194 3.0 1 4.0 4 3 4 2 3 1 3 ... 1.0 4.0 4.0 4.0 8.0 5.0 0 1 80.0 False
891195 4.0 2 6.0 3 1 5 1 1 5 2 ... 1.0 4.0 3.0 4.0 7.0 5.0 0 1 70.0 True
891196 2.0 2 6.0 1 5 2 5 4 3 1 ... 1.0 3.0 3.0 3.0 4.0 5.0 0 1 90.0 True
891197 3.0 2 1.0 3 2 4 1 3 3 5 ... 1.0 4.0 4.0 3.0 9.0 5.0 0 1 80.0 False
891198 3.0 1 5.0 2 5 3 4 5 3 1 ... 1.0 4.0 5.0 3.0 9.0 5.0 0 1 90.0 False
891199 2.0 1 3.0 2 5 2 3 3 4 1 ... 5.0 5.0 3.0 3.0 7.0 3.0 0 1 80.0 True
891200 1.0 2 3.0 1 5 3 5 5 5 1 ... 2.0 3.0 3.0 4.0 6.0 5.0 0 1 90.0 True
891201 3.0 1 3.0 4 2 3 2 2 3 6 ... 3.0 4.0 3.0 3.0 7.0 5.0 0 1 80.0 True
891202 2.0 2 5.0 1 4 1 2 5 4 1 ... 5.0 4.0 2.0 4.0 8.0 5.0 0 1 80.0 True
891204 3.0 1 5.0 4 3 4 4 2 1 6 ... 1.0 3.0 4.0 4.0 7.0 3.0 0 1 80.0 True
891205 4.0 1 2.0 4 2 4 3 3 4 2 ... 3.0 3.0 2.0 3.0 7.0 5.0 0 1 50.0 True
891206 1.0 2 4.0 3 4 2 4 4 2 4 ... 1.0 4.0 5.0 3.0 4.0 3.0 0 1 90.0 True
891207 3.0 2 1.0 5 1 5 1 2 4 2 ... 4.0 4.0 2.0 1.0 5.0 1.0 0 1 40.0 False
891209 1.0 2 5.0 1 4 2 3 4 4 1 ... 5.0 5.0 4.0 4.0 5.0 5.0 0 1 90.0 True
891210 3.0 1 5.0 3 3 4 2 2 2 6 ... 1.0 3.0 3.0 3.0 9.0 5.0 0 1 70.0 True
891211 3.0 1 2.0 3 2 4 3 3 2 2 ... 2.0 3.0 2.0 3.0 5.0 5.0 0 1 70.0 True
891212 4.0 1 1.0 3 1 5 1 1 5 5 ... 5.0 4.0 1.0 3.0 9.0 5.0 0 1 50.0 True
891213 4.0 2 5.0 3 3 3 5 3 2 6 ... 2.0 5.0 5.0 3.0 4.0 4.0 0 1 80.0 False
891214 1.0 2 4.0 1 5 2 3 3 4 1 ... 5.0 5.0 2.0 3.0 7.0 3.0 0 1 90.0 True
891215 2.0 2 6.0 1 5 2 4 5 4 1 ... 2.0 4.0 4.0 2.0 5.0 2.0 0 1 80.0 True
891219 1.0 1 3.0 1 5 3 5 5 5 1 ... 5.0 1.0 1.0 4.0 7.0 5.0 0 1 90.0 True
891220 4.0 1 1.0 4 2 5 2 1 5 6 ... 1.0 4.0 4.0 3.0 4.0 5.0 0 1 50.0 True

891221 rows × 80 columns

In [41]:
# Grabbed the infos from the .md file here, not important


# - 11: Wealthy Households - Pre-Family Couples & Singles
# - 12: Wealthy Households - Young Couples With Children
# - 13: Wealthy Households - Families With School Age Children
# - 14: Wealthy Households - Older Families &  Mature Couples
# - 15: Wealthy Households - Elders In Retirement
# - 21: Prosperous Households - Pre-Family Couples & Singles
# - 22: Prosperous Households - Young Couples With Children
# - 23: Prosperous Households - Families With School Age Children
# - 24: Prosperous Households - Older Families & Mature Couples
# - 25: Prosperous Households - Elders In Retirement
# - 31: Comfortable Households - Pre-Family Couples & Singles
# - 32: Comfortable Households - Young Couples With Children
# - 33: Comfortable Households - Families With School Age Children
# - 34: Comfortable Households - Older Families & Mature Couples
# - 35: Comfortable Households - Elders In Retirement
# - 41: Less Affluent Households - Pre-Family Couples & Singles
# - 42: Less Affluent Households - Young Couples With Children
# - 43: Less Affluent Households - Families With School Age Children
# - 44: Less Affluent Households - Older Families & Mature Couples
# - 45: Less Affluent Households - Elders In Retirement
# - 51: Poorer Households - Pre-Family Couples & Singles
# - 52: Poorer Households - Young Couples With Children
# - 53: Poorer Households - Families With School Age Children
# - 54: Poorer Households - Older Families & Mature Couples
# - 55: Poorer Households - Elders In Retirement
    

# "CAMEO_INTL_2015" combines information on two axes: wealth and life stage. 
# Break up the two-digit codes by their 'tens'-place and 'ones'-place digits into 
# two new ordinal variables (which, for the purposes of this project, is equivalent to 
# just treating them as their raw numeric values).
In [42]:
# Investigate "CAMEO_INTL_2015" and engineer two new variables.
azdias["CAMEO_INTL_2015"].value_counts()

# From the file, it seems that the "tens" digits are connected to wealth, and the "ones" are connected to life stage
Out[42]:
51    133694
41     92336
24     91158
14     62884
43     56672
54     45391
25     39628
22     33155
23     26750
13     26336
45     26132
55     23955
52     20542
31     19024
34     18524
15     16974
44     14820
12     13249
35     10356
32     10354
33      9935
Name: CAMEO_INTL_2015, dtype: int64
In [43]:
# Checking the column values
new_azdias["CAMEO_INTL_2015"]
Out[43]:
0         NaN
3          12
11        NaN
12         43
13         33
14        NaN
15         41
16         41
17        NaN
18         24
20         24
23         43
24        NaN
25         33
26        NaN
29         12
30        NaN
34         25
35        NaN
39        NaN
40        NaN
42         23
46        NaN
48        NaN
53        NaN
54        NaN
60         51
61        NaN
62        NaN
64         52
         ... 
891183     34
891186     45
891188     24
891189     25
891190     15
891191     44
891192     32
891193     43
891194     24
891195     55
891196     23
891197     24
891198     14
891199     51
891200     52
891201     41
891202     41
891204     24
891205     34
891206     25
891207     41
891209     51
891210     41
891211     51
891212     51
891213     34
891214     23
891215     31
891219     51
891220     43
Name: CAMEO_INTL_2015, Length: 891221, dtype: object
In [44]:
CAMEO_INTL_2015_WEALTH = {1.0:1, 2.0:2, 3.0:3,
                          4.0:4, 5.0:5}


CAMEO_INTL_2015_LIFE = {1:1, 2:2,
                        3:3, 4:4,
                        5:5}

print(CAMEO_INTL_2015_WEALTH)
print(CAMEO_INTL_2015_LIFE)
{1.0: 1, 2.0: 2, 3.0: 3, 4.0: 4, 5.0: 5}
{1: 1, 2: 2, 3: 3, 4: 4, 5: 5}
In [45]:
# Adding the new columns, and dropping CAMEO_INTL_2015
new_azdias["CAMEO_INTL_2015"] = new_azdias["CAMEO_INTL_2015"].replace("XX", 10)

new_azdias["CAMEO_INTL_2015_WEALTH"] = np.floor(new_azdias["CAMEO_INTL_2015"].astype(float)/10).replace(CAMEO_INTL_2015_WEALTH)

new_azdias["CAMEO_INTL_2015_LIFE"] = (new_azdias["CAMEO_INTL_2015"].astype(float)%10).replace(CAMEO_INTL_2015_LIFE)

# Dropiing the PRAEGENDE_JUGENDJAHRE column
new_azdias = new_azdias.drop(["CAMEO_INTL_2015"], axis=1)
In [46]:
print(mixed_columns.drop([15,16,22,56,59,64]))
print(categorical_columns.drop([0, 2, 3, 10, 12, 13, 17, 18, 19, 20, 21, 38, 41, 42, 47, 55, 58]))
79    PLZ8_BAUMAX
Name: attribute, dtype: object
52        GEBAEUDETYP
57    CAMEO_DEUG_2015
Name: attribute, dtype: object
In [47]:
# I decided to drop the categorized and mixed columns so they don't interfere with my model

mixed_columns
categorical_columns

# Here I'm dropping the mixed_columns, [15,16,22,56,59,64] are ones I already dropped, so I excluded them

for c in mixed_columns.drop([15,16,22,56,59,64]):
    new_azdias = new_azdias.drop([c], axis=1)
new_azdias
Out[47]:
ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP ... PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB OST_WEST_KZ_O OST_WEST_KZ_W PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_IS_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE
0 2.0 1 2.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
3 4.0 2 2.0 4 2 5 2 1 2 6 ... 4.0 2.0 3.0 3.0 0 1 70.0 True 1.0 2.0
11 2.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
12 3.0 1 6.0 5 3 4 2 4 1 3 ... 5.0 3.0 6.0 4.0 0 1 70.0 True 4.0 3.0
13 1.0 2 5.0 1 4 3 5 5 2 1 ... 3.0 3.0 6.0 4.0 0 1 90.0 False 3.0 3.0
14 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
15 4.0 2 4.0 4 1 5 1 1 4 2 ... NaN 4.0 8.0 5.0 0 1 60.0 True 4.0 1.0
16 1.0 2 1.0 4 3 1 4 5 1 3 ... 4.0 1.0 2.0 1.0 0 1 90.0 True 4.0 1.0
17 2.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
18 2.0 2 6.0 2 4 1 5 4 1 1 ... 3.0 3.0 4.0 3.0 0 1 80.0 True 2.0 4.0
20 2.0 2 4.0 4 3 1 4 5 1 3 ... NaN 3.0 4.0 1.0 0 1 80.0 False 2.0 4.0
23 3.0 1 3.0 5 3 3 2 2 1 6 ... NaN 3.0 6.0 2.0 0 1 70.0 False 4.0 3.0
24 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
25 1.0 1 3.0 3 5 3 5 4 3 4 ... 5.0 3.0 2.0 5.0 0 1 90.0 True 3.0 3.0
26 3.0 1 3.0 5 2 4 2 3 1 3 ... NaN 4.0 3.0 5.0 0 1 70.0 True NaN NaN
29 4.0 2 1.0 5 1 5 2 1 3 6 ... 4.0 2.0 4.0 3.0 0 1 60.0 True 1.0 2.0
30 3.0 2 3.0 4 3 4 4 4 1 3 ... NaN NaN NaN NaN 0 0 70.0 True NaN NaN
34 1.0 2 2.0 4 1 5 1 2 3 2 ... 3.0 4.0 7.0 5.0 0 1 50.0 False 2.0 5.0
35 2.0 2 2.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
39 4.0 1 4.0 5 3 4 3 1 1 6 ... 4.0 3.0 5.0 3.0 0 1 80.0 False NaN NaN
40 4.0 2 1.0 3 2 5 2 1 5 6 ... 3.0 3.0 5.0 3.0 0 1 NaN NaN NaN NaN
42 1.0 2 5.0 3 4 2 5 4 2 3 ... 3.0 3.0 3.0 1.0 0 1 90.0 True 2.0 3.0
46 NaN 2 3.0 2 4 3 5 5 4 1 ... NaN NaN NaN NaN 0 0 70.0 True NaN NaN
48 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
53 2.0 1 3.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
54 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
60 2.0 1 6.0 3 2 5 1 1 5 5 ... 2.0 3.0 7.0 2.0 0 1 70.0 True 5.0 1.0
61 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
62 3.0 1 4.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
64 2.0 2 6.0 1 4 2 4 4 4 1 ... 4.0 4.0 6.0 4.0 0 1 90.0 True 5.0 2.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
891183 3.0 1 3.0 5 1 5 1 1 2 5 ... 3.0 4.0 8.0 4.0 0 1 50.0 False 3.0 4.0
891186 1.0 1 5.0 4 3 1 1 5 1 3 ... 3.0 4.0 8.0 3.0 0 1 90.0 False 4.0 5.0
891188 1.0 2 3.0 4 3 1 4 5 2 4 ... 2.0 2.0 1.0 1.0 0 1 90.0 True 2.0 4.0
891189 2.0 2 3.0 3 2 4 1 3 2 5 ... 4.0 5.0 8.0 5.0 0 1 80.0 False 2.0 5.0
891190 3.0 1 1.0 5 1 5 1 2 2 2 ... 4.0 2.0 5.0 3.0 0 1 50.0 False 1.0 5.0
891191 4.0 2 1.0 4 1 5 1 1 4 2 ... 5.0 1.0 3.0 1.0 0 1 50.0 True 4.0 4.0
891192 1.0 2 3.0 1 5 2 5 3 5 1 ... 3.0 3.0 5.0 1.0 0 1 90.0 True 3.0 2.0
891193 4.0 1 3.0 4 1 5 1 2 5 5 ... 2.0 4.0 6.0 4.0 0 1 60.0 True 4.0 3.0
891194 3.0 1 4.0 4 3 4 2 3 1 3 ... 4.0 4.0 8.0 5.0 0 1 80.0 False 2.0 4.0
891195 4.0 2 6.0 3 1 5 1 1 5 2 ... 3.0 4.0 7.0 5.0 0 1 70.0 True 5.0 5.0
891196 2.0 2 6.0 1 5 2 5 4 3 1 ... 3.0 3.0 4.0 5.0 0 1 90.0 True 2.0 3.0
891197 3.0 2 1.0 3 2 4 1 3 3 5 ... 4.0 3.0 9.0 5.0 0 1 80.0 False 2.0 4.0
891198 3.0 1 5.0 2 5 3 4 5 3 1 ... 5.0 3.0 9.0 5.0 0 1 90.0 False 1.0 4.0
891199 2.0 1 3.0 2 5 2 3 3 4 1 ... 3.0 3.0 7.0 3.0 0 1 80.0 True 5.0 1.0
891200 1.0 2 3.0 1 5 3 5 5 5 1 ... 3.0 4.0 6.0 5.0 0 1 90.0 True 5.0 2.0
891201 3.0 1 3.0 4 2 3 2 2 3 6 ... 3.0 3.0 7.0 5.0 0 1 80.0 True 4.0 1.0
891202 2.0 2 5.0 1 4 1 2 5 4 1 ... 2.0 4.0 8.0 5.0 0 1 80.0 True 4.0 1.0
891204 3.0 1 5.0 4 3 4 4 2 1 6 ... 4.0 4.0 7.0 3.0 0 1 80.0 True 2.0 4.0
891205 4.0 1 2.0 4 2 4 3 3 4 2 ... 2.0 3.0 7.0 5.0 0 1 50.0 True 3.0 4.0
891206 1.0 2 4.0 3 4 2 4 4 2 4 ... 5.0 3.0 4.0 3.0 0 1 90.0 True 2.0 5.0
891207 3.0 2 1.0 5 1 5 1 2 4 2 ... 2.0 1.0 5.0 1.0 0 1 40.0 False 4.0 1.0
891209 1.0 2 5.0 1 4 2 3 4 4 1 ... 4.0 4.0 5.0 5.0 0 1 90.0 True 5.0 1.0
891210 3.0 1 5.0 3 3 4 2 2 2 6 ... 3.0 3.0 9.0 5.0 0 1 70.0 True 4.0 1.0
891211 3.0 1 2.0 3 2 4 3 3 2 2 ... 2.0 3.0 5.0 5.0 0 1 70.0 True 5.0 1.0
891212 4.0 1 1.0 3 1 5 1 1 5 5 ... 1.0 3.0 9.0 5.0 0 1 50.0 True 5.0 1.0
891213 4.0 2 5.0 3 3 3 5 3 2 6 ... 5.0 3.0 4.0 4.0 0 1 80.0 False 3.0 4.0
891214 1.0 2 4.0 1 5 2 3 3 4 1 ... 2.0 3.0 7.0 3.0 0 1 90.0 True 2.0 3.0
891215 2.0 2 6.0 1 5 2 4 5 4 1 ... 4.0 2.0 5.0 2.0 0 1 80.0 True 3.0 1.0
891219 1.0 1 3.0 1 5 3 5 5 5 1 ... 1.0 4.0 7.0 5.0 0 1 90.0 True 5.0 1.0
891220 4.0 1 1.0 4 2 5 2 1 5 6 ... 4.0 3.0 4.0 5.0 0 1 50.0 True 4.0 3.0

891221 rows × 80 columns

In [48]:
# Here I'm dropping the categorical_columns, [0, 2, 3, 10, 12, 13, 17, 18, 19, 20, 21, 38, 41, 42, 47, 55, 58]
# are ones I already dropped, so I excluded them

for c in categorical_columns.drop([0, 2, 3, 10, 12, 13, 17, 18, 19, 20, 21, 38, 41, 42, 47, 55, 58]):
    new_azdias = new_azdias.drop([c], axis=1)
new_azdias
Out[48]:
ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP ... PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB OST_WEST_KZ_O OST_WEST_KZ_W PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_IS_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE
0 2.0 1 2.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
3 4.0 2 2.0 4 2 5 2 1 2 6 ... 4.0 2.0 3.0 3.0 0 1 70.0 True 1.0 2.0
11 2.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
12 3.0 1 6.0 5 3 4 2 4 1 3 ... 5.0 3.0 6.0 4.0 0 1 70.0 True 4.0 3.0
13 1.0 2 5.0 1 4 3 5 5 2 1 ... 3.0 3.0 6.0 4.0 0 1 90.0 False 3.0 3.0
14 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
15 4.0 2 4.0 4 1 5 1 1 4 2 ... NaN 4.0 8.0 5.0 0 1 60.0 True 4.0 1.0
16 1.0 2 1.0 4 3 1 4 5 1 3 ... 4.0 1.0 2.0 1.0 0 1 90.0 True 4.0 1.0
17 2.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
18 2.0 2 6.0 2 4 1 5 4 1 1 ... 3.0 3.0 4.0 3.0 0 1 80.0 True 2.0 4.0
20 2.0 2 4.0 4 3 1 4 5 1 3 ... NaN 3.0 4.0 1.0 0 1 80.0 False 2.0 4.0
23 3.0 1 3.0 5 3 3 2 2 1 6 ... NaN 3.0 6.0 2.0 0 1 70.0 False 4.0 3.0
24 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
25 1.0 1 3.0 3 5 3 5 4 3 4 ... 5.0 3.0 2.0 5.0 0 1 90.0 True 3.0 3.0
26 3.0 1 3.0 5 2 4 2 3 1 3 ... NaN 4.0 3.0 5.0 0 1 70.0 True NaN NaN
29 4.0 2 1.0 5 1 5 2 1 3 6 ... 4.0 2.0 4.0 3.0 0 1 60.0 True 1.0 2.0
30 3.0 2 3.0 4 3 4 4 4 1 3 ... NaN NaN NaN NaN 0 0 70.0 True NaN NaN
34 1.0 2 2.0 4 1 5 1 2 3 2 ... 3.0 4.0 7.0 5.0 0 1 50.0 False 2.0 5.0
35 2.0 2 2.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
39 4.0 1 4.0 5 3 4 3 1 1 6 ... 4.0 3.0 5.0 3.0 0 1 80.0 False NaN NaN
40 4.0 2 1.0 3 2 5 2 1 5 6 ... 3.0 3.0 5.0 3.0 0 1 NaN NaN NaN NaN
42 1.0 2 5.0 3 4 2 5 4 2 3 ... 3.0 3.0 3.0 1.0 0 1 90.0 True 2.0 3.0
46 NaN 2 3.0 2 4 3 5 5 4 1 ... NaN NaN NaN NaN 0 0 70.0 True NaN NaN
48 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
53 2.0 1 3.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
54 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
60 2.0 1 6.0 3 2 5 1 1 5 5 ... 2.0 3.0 7.0 2.0 0 1 70.0 True 5.0 1.0
61 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
62 3.0 1 4.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN 0 0 NaN NaN NaN NaN
64 2.0 2 6.0 1 4 2 4 4 4 1 ... 4.0 4.0 6.0 4.0 0 1 90.0 True 5.0 2.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
891183 3.0 1 3.0 5 1 5 1 1 2 5 ... 3.0 4.0 8.0 4.0 0 1 50.0 False 3.0 4.0
891186 1.0 1 5.0 4 3 1 1 5 1 3 ... 3.0 4.0 8.0 3.0 0 1 90.0 False 4.0 5.0
891188 1.0 2 3.0 4 3 1 4 5 2 4 ... 2.0 2.0 1.0 1.0 0 1 90.0 True 2.0 4.0
891189 2.0 2 3.0 3 2 4 1 3 2 5 ... 4.0 5.0 8.0 5.0 0 1 80.0 False 2.0 5.0
891190 3.0 1 1.0 5 1 5 1 2 2 2 ... 4.0 2.0 5.0 3.0 0 1 50.0 False 1.0 5.0
891191 4.0 2 1.0 4 1 5 1 1 4 2 ... 5.0 1.0 3.0 1.0 0 1 50.0 True 4.0 4.0
891192 1.0 2 3.0 1 5 2 5 3 5 1 ... 3.0 3.0 5.0 1.0 0 1 90.0 True 3.0 2.0
891193 4.0 1 3.0 4 1 5 1 2 5 5 ... 2.0 4.0 6.0 4.0 0 1 60.0 True 4.0 3.0
891194 3.0 1 4.0 4 3 4 2 3 1 3 ... 4.0 4.0 8.0 5.0 0 1 80.0 False 2.0 4.0
891195 4.0 2 6.0 3 1 5 1 1 5 2 ... 3.0 4.0 7.0 5.0 0 1 70.0 True 5.0 5.0
891196 2.0 2 6.0 1 5 2 5 4 3 1 ... 3.0 3.0 4.0 5.0 0 1 90.0 True 2.0 3.0
891197 3.0 2 1.0 3 2 4 1 3 3 5 ... 4.0 3.0 9.0 5.0 0 1 80.0 False 2.0 4.0
891198 3.0 1 5.0 2 5 3 4 5 3 1 ... 5.0 3.0 9.0 5.0 0 1 90.0 False 1.0 4.0
891199 2.0 1 3.0 2 5 2 3 3 4 1 ... 3.0 3.0 7.0 3.0 0 1 80.0 True 5.0 1.0
891200 1.0 2 3.0 1 5 3 5 5 5 1 ... 3.0 4.0 6.0 5.0 0 1 90.0 True 5.0 2.0
891201 3.0 1 3.0 4 2 3 2 2 3 6 ... 3.0 3.0 7.0 5.0 0 1 80.0 True 4.0 1.0
891202 2.0 2 5.0 1 4 1 2 5 4 1 ... 2.0 4.0 8.0 5.0 0 1 80.0 True 4.0 1.0
891204 3.0 1 5.0 4 3 4 4 2 1 6 ... 4.0 4.0 7.0 3.0 0 1 80.0 True 2.0 4.0
891205 4.0 1 2.0 4 2 4 3 3 4 2 ... 2.0 3.0 7.0 5.0 0 1 50.0 True 3.0 4.0
891206 1.0 2 4.0 3 4 2 4 4 2 4 ... 5.0 3.0 4.0 3.0 0 1 90.0 True 2.0 5.0
891207 3.0 2 1.0 5 1 5 1 2 4 2 ... 2.0 1.0 5.0 1.0 0 1 40.0 False 4.0 1.0
891209 1.0 2 5.0 1 4 2 3 4 4 1 ... 4.0 4.0 5.0 5.0 0 1 90.0 True 5.0 1.0
891210 3.0 1 5.0 3 3 4 2 2 2 6 ... 3.0 3.0 9.0 5.0 0 1 70.0 True 4.0 1.0
891211 3.0 1 2.0 3 2 4 3 3 2 2 ... 2.0 3.0 5.0 5.0 0 1 70.0 True 5.0 1.0
891212 4.0 1 1.0 3 1 5 1 1 5 5 ... 1.0 3.0 9.0 5.0 0 1 50.0 True 5.0 1.0
891213 4.0 2 5.0 3 3 3 5 3 2 6 ... 5.0 3.0 4.0 4.0 0 1 80.0 False 3.0 4.0
891214 1.0 2 4.0 1 5 2 3 3 4 1 ... 2.0 3.0 7.0 3.0 0 1 90.0 True 2.0 3.0
891215 2.0 2 6.0 1 5 2 4 5 4 1 ... 4.0 2.0 5.0 2.0 0 1 80.0 True 3.0 1.0
891219 1.0 1 3.0 1 5 3 5 5 5 1 ... 1.0 4.0 7.0 5.0 0 1 90.0 True 5.0 1.0
891220 4.0 1 1.0 4 2 5 2 1 5 6 ... 4.0 3.0 4.0 5.0 0 1 50.0 True 4.0 3.0

891221 rows × 78 columns

Discussion 1.2.2: Engineer Mixed-Type Features

(Double-click this cell and replace this text with your own text, reporting your findings and decisions regarding mixed-value features. Which ones did you keep, which did you drop, and what engineering steps did you perform?)

ANSWER:

I changed both PRAEGENDE_JUGENDJAHRE and CAMEO_INTL, unmixed the values given the information from the .md file

I dropped the rest of mixed/categorical data.

Step 1.2.3: Complete Feature Selection

In order to finish this step up, you need to make sure that your data frame now only has the columns that you want to keep. To summarize, the dataframe should consist of the following:

  • All numeric, interval, and ordinal type columns from the original dataset.
  • Binary categorical features (all numerically-encoded).
  • Engineered features from other multi-level categorical features and mixed features.

Make sure that for any new columns that you have engineered, that you've excluded the original columns from the final dataset. Otherwise, their values will interfere with the analysis later on the project. For example, you should not keep "PRAEGENDE_JUGENDJAHRE", since its values won't be useful for the algorithm: only the values derived from it in the engineered features you created should be retained. As a reminder, your data should only be from the subset with few or no missing values.

In [49]:
# If there are other re-engineering tasks you need to perform, make sure you
# take care of them here. (Dealing with missing data will come in step 2.1.)
In [50]:
# Do whatever you need to in order to ensure that the dataframe only contains
# the columns that should be passed to the algorithm functions.

Step 1.3: Create a Cleaning Function

Even though you've finished cleaning up the general population demographics data, it's important to look ahead to the future and realize that you'll need to perform the same cleaning steps on the customer demographics data. In this substep, complete the function below to execute the main feature selection, encoding, and re-engineering steps you performed above. Then, when it comes to looking at the customer data in Step 3, you can just run this function on that DataFrame to get the trimmed dataset in a single step.

In [51]:
def clean_data(df):
    """
    Perform feature trimming, re-encoding, and engineering for demographics
    data
    
    INPUT: Demographics DataFrame
    OUTPUT: Trimmed and cleaned demographics DataFrame
    """
    
    # Put in code here to execute all main cleaning steps:
    # convert missing value codes into NaNs, ...
    total = 0
    
    # Getting the column name to use in azdias DataFrame
        # For each value I want to replace, given from "converted" above
        
    for i in range(len(feat_info['missing_or_unknown'])):
        converted = convert(feat_info['missing_or_unknown'][i])

        for value in converted:
            try:
                value = int(value)
            except:
                None
            df[feat_info['attribute'][i]] = df[feat_info['attribute'][i]].replace(value, np.NaN)

            
            
        
    total = 0    
    df_dic_attr_NaN = {}
    
    for column in feat_info['attribute']:
        n_NaN = df.loc[df[column].astype(str) == "nan"][column].isna().count()
        total += n_NaN
        dic_attr_NaN[column] = n_NaN
        
        
        
            
    for i in most_NaN.keys():
        df = df.drop([i], axis=1)
    
    
    # remove selected columns and rows, ...

    df_NaN_appeared = df[df.isna().any(axis=1)]

    df_NaN_missing = df[df.isna().any(axis=1) == False]
    
    df_NaN_appeared = df_NaN_appeared.dropna(thresh=20)

    frames = [df_NaN_appeared, df_NaN_missing]
    
    df = pd.concat(frames)
        
    # select, re-encode, and engineer column values.

    df = pd.get_dummies(df, columns=["OST_WEST_KZ"])
    df = df.drop(["CAMEO_DEU_2015"], axis=1)
    
    
    df["PRAEGENDE_JUGENDJAHRE_DECADE"] = df["PRAEGENDE_JUGENDJAHRE"].replace(PRAEGENDE_JUGENDJAHRE_DECADE)

    df["PRAEGENDE_JUGENDJAHRE_IS_MAINSTREAM"] = df["PRAEGENDE_JUGENDJAHRE"].replace(PRAEGENDE_JUGENDJAHRE_IS_MAINSTREAM)

    df = df.drop(["PRAEGENDE_JUGENDJAHRE"], axis=1)
    

    df["CAMEO_INTL_2015_WEALTH"] = np.floor(df["CAMEO_INTL_2015"].astype(float)/10).replace(CAMEO_INTL_2015_WEALTH)

    df["CAMEO_INTL_2015_LIFE"] = (df["CAMEO_INTL_2015"].astype(float)%10).replace(CAMEO_INTL_2015_LIFE)
    
    df = df.drop(["CAMEO_INTL_2015"], axis=1)
    
    for c in mixed_columns.drop([15,16,22,56,59,64]):
        df = df.drop([c], axis=1)

    for c in categorical_columns.drop([0, 2, 3, 10, 12, 13, 17, 18, 19, 20, 21, 38, 41, 42, 47, 55, 58]):
        df = df.drop([c], axis=1)
    
    # Return the cleaned dataframe.
    
    return df

Step 2: Feature Transformation

Step 2.1: Apply Feature Scaling

Before we apply dimensionality reduction techniques to the data, we need to perform feature scaling so that the principal component vectors are not influenced by the natural differences in scale for features. Starting from this part of the project, you'll want to keep an eye on the API reference page for sklearn to help you navigate to all of the classes and functions that you'll need. In this substep, you'll need to check the following:

  • sklearn requires that data not have missing values in order for its estimators to work properly. So, before applying the scaler to your data, make sure that you've cleaned the DataFrame of the remaining missing values. This can be as simple as just removing all data points with missing data, or applying an Imputer to replace all missing values. You might also try a more complicated procedure where you temporarily remove missing values in order to compute the scaling parameters before re-introducing those missing values and applying imputation. Think about how much missing data you have and what possible effects each approach might have on your analysis, and justify your decision in the discussion section below.
  • For the actual scaling function, a StandardScaler instance is suggested, scaling each feature to mean 0 and standard deviation 1.
  • For these classes, you can make use of the .fit_transform() method to both fit a procedure to the data as well as apply the transformation to the data at the same time. Don't forget to keep the fit sklearn objects handy, since you'll be applying them to the customer demographics data towards the end of the project.
In [52]:
# If you've not yet cleaned the dataset of all NaN values, then investigate and
# do that now.
new_azdias = new_azdias.dropna(thresh=78)
new_azdias

# number of rows are 623209 now..
Out[52]:
ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP ... PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB OST_WEST_KZ_O OST_WEST_KZ_W PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_IS_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE
1 1.0 2 5.0 1 5 2 5 4 5 1 ... 4.0 3.0 5.0 4.0 0 1 90.0 True 5.0 1.0
2 3.0 2 3.0 1 4 1 2 3 5 1 ... 4.0 3.0 5.0 2.0 0 1 90.0 False 2.0 4.0
4 3.0 1 5.0 4 3 4 1 3 2 5 ... 3.0 4.0 6.0 5.0 0 1 70.0 True 4.0 3.0
5 1.0 2 2.0 3 1 5 2 2 5 2 ... 5.0 2.0 3.0 3.0 0 1 50.0 True 5.0 4.0
6 2.0 2 5.0 1 5 1 5 4 3 4 ... 5.0 4.0 6.0 3.0 0 1 80.0 True 2.0 2.0
7 1.0 1 3.0 3 3 4 1 3 2 5 ... 4.0 2.0 5.0 2.0 0 1 70.0 True 1.0 4.0
8 3.0 1 3.0 4 4 2 4 2 2 6 ... 3.0 2.0 4.0 3.0 0 1 80.0 False 1.0 3.0
9 3.0 2 4.0 2 4 2 3 5 4 1 ... 3.0 2.0 3.0 1.0 0 1 90.0 False 1.0 5.0
10 3.0 2 1.0 2 2 5 3 1 5 6 ... 3.0 4.0 6.0 5.0 0 1 50.0 True 5.0 1.0
19 3.0 1 3.0 5 2 3 1 3 1 5 ... 4.0 4.0 6.0 3.0 0 1 70.0 False 3.0 4.0
21 2.0 1 3.0 3 4 1 2 5 1 3 ... 3.0 5.0 7.0 5.0 0 1 80.0 True 5.0 5.0
22 1.0 1 4.0 1 5 3 5 5 5 1 ... 3.0 4.0 5.0 5.0 0 1 90.0 True 5.0 1.0
27 3.0 1 4.0 3 3 4 1 2 2 5 ... 3.0 4.0 7.0 5.0 0 1 80.0 False 5.0 1.0
28 3.0 1 2.0 3 2 4 3 3 2 3 ... 5.0 3.0 4.0 1.0 0 1 80.0 True 1.0 3.0
31 2.0 2 6.0 1 5 2 3 5 4 1 ... 2.0 4.0 8.0 5.0 0 1 80.0 True 5.0 1.0
32 1.0 1 4.0 5 4 2 3 4 1 3 ... 4.0 4.0 3.0 5.0 0 1 90.0 False 1.0 4.0
33 2.0 2 4.0 2 2 4 1 3 4 5 ... 3.0 3.0 8.0 5.0 0 1 70.0 True 4.0 1.0
36 3.0 2 2.0 2 3 4 3 2 3 6 ... 3.0 4.0 9.0 4.0 0 1 70.0 True 4.0 1.0
37 4.0 1 2.0 5 1 5 1 1 3 6 ... 4.0 3.0 5.0 2.0 0 1 70.0 True 2.0 4.0
38 2.0 2 6.0 3 3 1 4 4 2 4 ... 4.0 3.0 5.0 2.0 0 1 80.0 True 2.0 5.0
41 1.0 2 3.0 2 5 3 5 5 2 1 ... 5.0 3.0 5.0 2.0 0 1 90.0 False 2.0 5.0
43 1.0 2 4.0 1 5 3 5 5 5 1 ... 4.0 3.0 7.0 2.0 0 1 90.0 True 2.0 5.0
44 3.0 2 4.0 3 2 4 4 2 2 6 ... 3.0 1.0 1.0 1.0 0 1 70.0 True 2.0 4.0
45 4.0 2 2.0 5 1 5 2 1 2 2 ... 5.0 3.0 2.0 3.0 0 1 60.0 True 4.0 1.0
47 4.0 2 2.0 4 1 5 1 2 3 5 ... 3.0 3.0 5.0 3.0 0 1 50.0 False 2.0 5.0
49 2.0 1 6.0 4 5 1 4 4 2 4 ... 5.0 3.0 4.0 2.0 0 1 80.0 True 4.0 3.0
50 3.0 1 6.0 2 5 2 4 5 2 1 ... 4.0 2.0 1.0 1.0 0 1 90.0 True 4.0 3.0
51 1.0 1 5.0 2 5 3 5 5 2 1 ... 4.0 3.0 3.0 3.0 0 1 90.0 False 3.0 1.0
52 1.0 2 1.0 1 4 2 2 4 3 1 ... 3.0 4.0 9.0 3.0 0 1 90.0 True 4.0 1.0
55 3.0 2 5.0 2 3 4 2 2 3 1 ... 4.0 3.0 8.0 5.0 0 1 70.0 True 5.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
891183 3.0 1 3.0 5 1 5 1 1 2 5 ... 3.0 4.0 8.0 4.0 0 1 50.0 False 3.0 4.0
891186 1.0 1 5.0 4 3 1 1 5 1 3 ... 3.0 4.0 8.0 3.0 0 1 90.0 False 4.0 5.0
891188 1.0 2 3.0 4 3 1 4 5 2 4 ... 2.0 2.0 1.0 1.0 0 1 90.0 True 2.0 4.0
891189 2.0 2 3.0 3 2 4 1 3 2 5 ... 4.0 5.0 8.0 5.0 0 1 80.0 False 2.0 5.0
891190 3.0 1 1.0 5 1 5 1 2 2 2 ... 4.0 2.0 5.0 3.0 0 1 50.0 False 1.0 5.0
891191 4.0 2 1.0 4 1 5 1 1 4 2 ... 5.0 1.0 3.0 1.0 0 1 50.0 True 4.0 4.0
891192 1.0 2 3.0 1 5 2 5 3 5 1 ... 3.0 3.0 5.0 1.0 0 1 90.0 True 3.0 2.0
891193 4.0 1 3.0 4 1 5 1 2 5 5 ... 2.0 4.0 6.0 4.0 0 1 60.0 True 4.0 3.0
891194 3.0 1 4.0 4 3 4 2 3 1 3 ... 4.0 4.0 8.0 5.0 0 1 80.0 False 2.0 4.0
891195 4.0 2 6.0 3 1 5 1 1 5 2 ... 3.0 4.0 7.0 5.0 0 1 70.0 True 5.0 5.0
891196 2.0 2 6.0 1 5 2 5 4 3 1 ... 3.0 3.0 4.0 5.0 0 1 90.0 True 2.0 3.0
891197 3.0 2 1.0 3 2 4 1 3 3 5 ... 4.0 3.0 9.0 5.0 0 1 80.0 False 2.0 4.0
891198 3.0 1 5.0 2 5 3 4 5 3 1 ... 5.0 3.0 9.0 5.0 0 1 90.0 False 1.0 4.0
891199 2.0 1 3.0 2 5 2 3 3 4 1 ... 3.0 3.0 7.0 3.0 0 1 80.0 True 5.0 1.0
891200 1.0 2 3.0 1 5 3 5 5 5 1 ... 3.0 4.0 6.0 5.0 0 1 90.0 True 5.0 2.0
891201 3.0 1 3.0 4 2 3 2 2 3 6 ... 3.0 3.0 7.0 5.0 0 1 80.0 True 4.0 1.0
891202 2.0 2 5.0 1 4 1 2 5 4 1 ... 2.0 4.0 8.0 5.0 0 1 80.0 True 4.0 1.0
891204 3.0 1 5.0 4 3 4 4 2 1 6 ... 4.0 4.0 7.0 3.0 0 1 80.0 True 2.0 4.0
891205 4.0 1 2.0 4 2 4 3 3 4 2 ... 2.0 3.0 7.0 5.0 0 1 50.0 True 3.0 4.0
891206 1.0 2 4.0 3 4 2 4 4 2 4 ... 5.0 3.0 4.0 3.0 0 1 90.0 True 2.0 5.0
891207 3.0 2 1.0 5 1 5 1 2 4 2 ... 2.0 1.0 5.0 1.0 0 1 40.0 False 4.0 1.0
891209 1.0 2 5.0 1 4 2 3 4 4 1 ... 4.0 4.0 5.0 5.0 0 1 90.0 True 5.0 1.0
891210 3.0 1 5.0 3 3 4 2 2 2 6 ... 3.0 3.0 9.0 5.0 0 1 70.0 True 4.0 1.0
891211 3.0 1 2.0 3 2 4 3 3 2 2 ... 2.0 3.0 5.0 5.0 0 1 70.0 True 5.0 1.0
891212 4.0 1 1.0 3 1 5 1 1 5 5 ... 1.0 3.0 9.0 5.0 0 1 50.0 True 5.0 1.0
891213 4.0 2 5.0 3 3 3 5 3 2 6 ... 5.0 3.0 4.0 4.0 0 1 80.0 False 3.0 4.0
891214 1.0 2 4.0 1 5 2 3 3 4 1 ... 2.0 3.0 7.0 3.0 0 1 90.0 True 2.0 3.0
891215 2.0 2 6.0 1 5 2 4 5 4 1 ... 4.0 2.0 5.0 2.0 0 1 80.0 True 3.0 1.0
891219 1.0 1 3.0 1 5 3 5 5 5 1 ... 1.0 4.0 7.0 5.0 0 1 90.0 True 5.0 1.0
891220 4.0 1 1.0 4 2 5 2 1 5 6 ... 4.0 3.0 4.0 5.0 0 1 50.0 True 4.0 3.0

623209 rows × 78 columns

In [53]:
new_azdias.isna().sum().sum()
Out[53]:
0
In [89]:
# Apply feature scaling to the general population demographics data.
from sklearn.preprocessing import StandardScaler

scaler_fit = StandardScaler().fit(new_azdias.astype(float))
scaler = scaler_fit.transform(new_azdias.astype(float))
print(scaler)
[[-1.7462911   0.97782476  1.0051361  ...,  0.4969323   1.14788172
  -1.25110753]
 [ 0.20210748  0.97782476 -0.30930461 ..., -2.01234653 -0.90999737
   0.74982182]
 [ 0.20210748 -1.02267813  1.0051361  ...,  0.4969323   0.46192202
   0.08284537]
 ..., 
 [-0.77209181  0.97782476  1.66235646 ...,  0.4969323  -0.22403767
  -1.25110753]
 [-1.7462911  -1.02267813 -0.30930461 ...,  0.4969323   1.14788172
  -1.25110753]
 [ 1.17630677 -1.02267813 -1.62374532 ...,  0.4969323   0.46192202
   0.08284537]]

Discussion 2.1: Apply Feature Scaling

(Double-click this cell and replace this text with your own text, reporting your decisions regarding feature scaling.)

Answer:

I removed the rows with NaN values. Reason: I removed those rows mainly to reduce bias
I fit_transform the data using StandardScaler from sklearn.

Step 2.2: Perform Dimensionality Reduction

On your scaled data, you are now ready to apply dimensionality reduction techniques.

  • Use sklearn's PCA class to apply principal component analysis on the data, thus finding the vectors of maximal variance in the data. To start, you should not set any parameters (so all components are computed) or set a number of components that is at least half the number of features (so there's enough features to see the general trend in variability).
  • Check out the ratio of variance explained by each principal component as well as the cumulative variance explained. Try plotting the cumulative or sequential values using matplotlib's plot() function. Based on what you find, select a value for the number of transformed features you'll retain for the clustering part of the project.
  • Once you've made a choice for the number of components to keep, make sure you re-fit a PCA instance to perform the decided-on transformation.
In [55]:
def doPCA(data, n_components=False):
    X = StandardScaler().fit_transform(data)
    pca = None
    print("pcanone")
    if not n_components:
        print("if not")
        pca = PCA()
    else:
        print("else")
        pca = PCA(n_components)

    X_pca = pca.fit_transform(X)
    print("fit")
    return pca, X_pca
In [56]:
def pca_results(full_dataset, pca):
    
    dimensions = dimensions = ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)]

    components = pd.DataFrame(np.round(pca.components_, 4), columns = full_dataset.keys())
    components.index = dimensions

    ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1)
    variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance'])
    variance_ratios.index = dimensions

    fig, ax = plt.subplots(figsize = (14,8))

    components.plot(ax = ax, kind = 'bar');
    ax.set_ylabel("Feature Weights")
    ax.set_xticklabels(dimensions, rotation=0)
    
    for i, ev in enumerate(pca.explained_variance_ratio_):
        ax.text(i-0.40, ax.get_ylim()[1] + 0.05, "Explained Variance\n          %.4f"%(ev))

    return pd.concat([variance_ratios, components], axis = 1)
In [57]:
# Apply PCA to the data.
from sklearn.decomposition import PCA
pca, X_pca = doPCA(new_azdias)

print(pca.explained_variance_ratio_)
pcanone
if not
fit
[  1.72492997e-01   1.20388201e-01   8.19630269e-02   5.89165951e-02
   4.56787101e-02   3.38481297e-02   2.77662614e-02   2.47845154e-02
   2.21153050e-02   2.00876380e-02   1.90525083e-02   1.72394352e-02
   1.57180121e-02   1.45643565e-02   1.43289467e-02   1.34985990e-02
   1.28373240e-02   1.27250896e-02   1.20491892e-02   1.18071983e-02
   1.16004812e-02   1.09780799e-02   1.06516614e-02   1.03129386e-02
   1.00024073e-02   9.85854602e-03   9.63427198e-03   9.17738339e-03
   8.82009977e-03   8.64777429e-03   8.21483643e-03   7.96451596e-03
   7.12601715e-03   6.42037721e-03   6.10271577e-03   5.95235192e-03
   5.94659495e-03   5.79670708e-03   5.45377758e-03   5.15379562e-03
   5.06407833e-03   4.84148659e-03   4.68476309e-03   4.15297408e-03
   3.94284404e-03   3.81899262e-03   3.58565907e-03   3.39209644e-03
   3.29371021e-03   3.13794936e-03   3.09454663e-03   2.91073389e-03
   2.87309024e-03   2.67965013e-03   2.61256009e-03   2.57466408e-03
   2.41851153e-03   2.29526767e-03   2.17291521e-03   2.06753163e-03
   2.01139329e-03   1.85621200e-03   1.82123293e-03   1.76853815e-03
   1.72652085e-03   1.57252762e-03   1.34683142e-03   1.32996462e-03
   1.13174608e-03   1.06412697e-03   1.02785878e-03   9.75149490e-04
   4.93018771e-04   3.76341441e-04   9.01059915e-05   8.53328834e-05
   3.17017502e-05   4.70519124e-30]
In [58]:
# Investigate the variance accounted for by each principal component.

pca_results(new_azdias, pca)
Out[58]:
Explained Variance ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB OST_WEST_KZ_O OST_WEST_KZ_W PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_IS_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE
Dimension 1 0.1725 -0.1087 0.0092 0.0387 -0.2066 0.1299 -0.1024 0.0639 0.0660 0.1531 ... -0.1452 0.1243 0.1649 0.1116 0.0571 -0.0571 0.0908 0.0991 0.1833 -0.1118
Dimension 2 0.1204 0.2558 0.0780 -0.1224 0.0724 -0.2327 0.2225 -0.1986 -0.2237 0.1044 ... -0.0615 0.0586 0.0732 0.0521 0.0328 -0.0328 -0.2466 0.0390 0.0594 0.0038
Dimension 3 0.0820 0.0666 -0.3629 -0.0308 0.1476 -0.0882 0.0834 -0.1736 -0.0869 -0.0472 ... -0.0408 0.0382 0.0499 0.0318 0.0223 -0.0223 -0.0950 -0.0337 0.0285 -0.0123
Dimension 4 0.0589 0.0017 0.0243 0.0667 0.0195 -0.0078 0.0048 -0.0622 -0.0011 -0.0250 ... -0.1031 0.1127 0.2127 0.1182 0.0062 -0.0062 0.0130 -0.1603 0.0173 -0.0090
Dimension 5 0.0457 0.0506 -0.0242 -0.0185 -0.0298 -0.0083 0.0034 0.1267 -0.1276 0.0885 ... -0.1248 0.0301 -0.1699 -0.0525 0.2298 -0.2298 -0.0264 0.2702 0.1271 -0.0494
Dimension 6 0.0338 -0.0428 0.0359 0.0201 0.0753 0.0044 -0.1310 0.0539 -0.0588 -0.0439 ... -0.2657 0.2162 0.0715 0.0572 0.3976 -0.3976 0.0343 -0.1435 -0.0624 0.0853
Dimension 7 0.0278 -0.0025 0.0263 -0.0308 -0.0614 -0.0001 -0.0385 0.0162 -0.0454 0.1185 ... 0.1311 -0.1598 -0.1081 -0.2381 0.2081 -0.2081 -0.0130 -0.0472 -0.0635 0.0647
Dimension 8 0.0248 -0.0142 -0.0147 0.0164 0.0245 0.0183 0.0353 -0.0191 0.0602 -0.1200 ... -0.1914 -0.1810 0.0372 -0.0639 -0.2558 0.2558 0.0312 0.0247 -0.0385 -0.0471
Dimension 9 0.0221 0.0108 0.0569 -0.1030 -0.0845 -0.0471 0.0873 -0.1009 0.0420 0.1779 ... -0.2247 -0.1511 -0.0965 -0.0293 -0.1537 0.1537 -0.0971 0.0057 -0.0124 0.0187
Dimension 10 0.0201 -0.1142 0.0788 -0.0858 0.1399 -0.1787 0.1492 -0.1181 -0.2019 -0.0263 ... 0.0913 0.0934 0.0670 0.0466 0.0067 -0.0067 -0.1819 0.1111 0.0158 0.0071
Dimension 11 0.0191 -0.0693 -0.0402 0.0399 0.1267 -0.0776 0.0674 -0.1106 -0.1636 -0.1761 ... -0.0833 -0.1476 -0.0989 -0.0721 -0.0669 0.0669 -0.0097 0.0951 0.0399 -0.1123
Dimension 12 0.0172 -0.1129 -0.0080 -0.0772 -0.1155 -0.0455 0.1219 -0.0561 -0.0645 0.2548 ... 0.0959 0.1341 0.0979 0.0581 0.0557 -0.0557 -0.0854 0.0422 -0.1813 0.3359
Dimension 13 0.0157 -0.0187 0.0054 -0.0683 0.0015 0.0123 -0.0135 0.0545 0.0136 0.0104 ... 0.0373 0.1467 -0.0340 0.1869 0.0047 -0.0047 -0.0003 -0.0519 0.0705 -0.0072
Dimension 14 0.0146 0.1356 -0.0613 0.3824 0.0381 -0.0121 -0.0394 -0.0753 -0.0599 -0.1529 ... 0.0838 0.0889 0.0748 0.0870 -0.0532 0.0532 0.0600 0.1150 -0.0512 0.1191
Dimension 15 0.0143 -0.0140 0.0083 0.0781 -0.0134 -0.0272 0.0182 -0.0164 -0.0334 0.1027 ... -0.0317 -0.2461 0.0847 -0.4217 0.1085 -0.1085 -0.0239 0.1722 -0.1490 0.0864
Dimension 16 0.0135 -0.0167 0.0016 0.0109 0.0019 0.0260 -0.0398 0.0558 0.0212 -0.1088 ... -0.0782 -0.2388 -0.0319 -0.2976 -0.0080 0.0080 0.0274 -0.2005 -0.0623 0.1064
Dimension 17 0.0128 -0.0250 0.0229 0.0831 0.0214 -0.0524 0.0797 -0.0194 -0.0660 0.0423 ... -0.0181 -0.0172 -0.0170 -0.0035 -0.0146 0.0146 -0.0499 -0.0252 0.0200 -0.0293
Dimension 18 0.0127 0.0585 -0.0567 -0.1571 -0.0422 0.1251 -0.1979 0.0433 0.1600 -0.1141 ... 0.0421 0.0573 0.0399 0.0354 0.0378 -0.0378 0.1278 0.1292 -0.0546 0.0981
Dimension 19 0.0120 0.0716 -0.0253 -0.2730 -0.0604 0.0717 -0.0706 0.0028 0.0788 -0.0572 ... 0.0827 0.0627 0.0715 0.0360 -0.0309 0.0309 0.0370 -0.0619 0.0167 -0.2158
Dimension 20 0.0118 -0.0207 0.0098 0.0749 0.0258 -0.0632 0.0972 -0.0934 0.0215 -0.0365 ... -0.0397 0.0102 -0.0095 0.0975 -0.0882 0.0882 -0.0561 0.2087 -0.0609 0.1490
Dimension 21 0.0116 0.0163 -0.0189 -0.0335 0.0217 -0.0085 -0.0065 0.0003 -0.0077 0.0387 ... -0.0194 -0.0443 -0.0256 -0.0695 0.0565 -0.0565 -0.0245 -0.0847 0.0033 0.0511
Dimension 22 0.0110 -0.0269 -0.0229 -0.4942 0.0404 0.0267 -0.0738 -0.0106 0.0890 -0.0841 ... -0.0458 0.0053 -0.0167 0.0575 -0.0083 0.0083 0.0325 0.2259 -0.0966 0.2105
Dimension 23 0.0107 0.0305 0.0057 0.2251 0.0075 0.0130 -0.0047 -0.0308 0.0152 0.0242 ... 0.0395 0.0638 0.0368 0.0805 0.0056 -0.0056 0.0164 0.1822 0.0513 -0.3359
Dimension 24 0.0103 0.0364 0.0113 -0.3118 -0.0081 0.0434 -0.0445 0.0255 -0.0354 -0.0624 ... 0.0345 -0.0196 0.0282 -0.0742 0.0119 -0.0119 0.0582 -0.1582 0.1561 -0.3458
Dimension 25 0.0100 0.0902 0.0137 0.0041 0.0457 0.0923 -0.1397 0.0712 0.0524 -0.1976 ... -0.0811 -0.0732 0.0090 -0.0643 -0.0280 0.0280 0.1073 0.1404 -0.0543 0.1782
Dimension 26 0.0099 -0.0201 0.0304 -0.0018 0.0319 -0.0093 0.0078 0.0146 -0.0086 -0.0661 ... 0.0618 0.1086 0.0160 0.2163 -0.0433 0.0433 -0.0013 0.0248 -0.0736 0.0692
Dimension 27 0.0096 -0.0798 0.0252 0.0711 -0.0079 -0.0105 0.0561 -0.0108 -0.0081 0.0673 ... 0.0082 0.0091 0.0202 -0.0335 0.0373 -0.0373 -0.0456 0.1145 0.1464 -0.4282
Dimension 28 0.0092 0.0721 -0.0446 -0.0386 0.0630 -0.0289 -0.0362 -0.1231 0.0672 0.0331 ... 0.0132 0.0043 0.0118 -0.0619 0.0411 -0.0411 -0.0211 0.0675 0.0297 -0.1847
Dimension 29 0.0088 0.0455 -0.0346 -0.2248 0.0350 0.0138 -0.0222 0.0248 -0.0329 -0.0570 ... -0.0150 -0.0994 0.1135 -0.1870 0.0138 -0.0138 -0.0078 0.1434 0.0019 -0.1188
Dimension 30 0.0086 -0.0168 -0.0083 0.0062 -0.0060 -0.1066 0.1494 -0.1039 -0.0745 0.1525 ... -0.0551 -0.0924 0.0579 -0.1122 0.0108 -0.0108 -0.0511 -0.0685 -0.0149 -0.0474
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Dimension 49 0.0033 -0.0345 -0.0051 0.0371 -0.0893 -0.0668 -0.0125 -0.0184 0.3434 0.3970 ... -0.0267 -0.0833 0.0044 0.0569 -0.0251 0.0251 0.0037 0.0257 -0.0018 -0.0161
Dimension 50 0.0031 0.0038 0.0238 -0.0069 0.0068 -0.0339 -0.0133 -0.0508 0.0198 -0.0719 ... -0.1048 0.2071 -0.2025 -0.0346 -0.0173 0.0173 -0.0287 -0.0442 -0.0472 -0.0212
Dimension 51 0.0031 -0.0207 0.0332 0.0304 0.0376 -0.0492 0.0638 0.0907 0.0589 -0.0125 ... -0.0143 0.0018 -0.0773 0.0567 -0.0282 0.0282 0.0335 0.0423 0.1436 0.0323
Dimension 52 0.0029 0.0038 0.0337 -0.0422 0.0141 -0.0328 -0.1491 -0.1678 0.0961 -0.1493 ... -0.0423 -0.0057 -0.0008 0.0130 0.0228 -0.0228 -0.1436 -0.0686 -0.1988 -0.0838
Dimension 53 0.0029 -0.0028 -0.0359 -0.0100 -0.0425 0.1149 0.1468 0.1333 -0.2395 -0.2380 ... 0.0453 0.3919 0.1346 -0.3737 -0.0844 0.0844 0.0478 -0.0609 -0.1747 -0.0182
Dimension 54 0.0027 0.0135 -0.0340 0.0004 -0.0132 0.0712 -0.0445 0.0068 -0.3056 -0.1851 ... -0.0161 -0.4938 -0.0949 0.3885 0.1535 -0.1535 0.0191 -0.0857 -0.0259 0.0084
Dimension 55 0.0026 -0.0656 0.0575 0.0028 0.0666 -0.0373 -0.0122 -0.0212 0.0208 -0.1347 ... 0.0344 -0.1500 0.0598 0.0764 0.0151 -0.0151 -0.0365 -0.0380 -0.0596 -0.0213
Dimension 56 0.0026 0.0416 -0.0721 -0.0050 -0.1108 0.0807 0.1117 0.0664 -0.1471 0.0967 ... 0.0330 -0.0358 -0.0448 0.0044 -0.0156 0.0156 0.1451 0.0285 0.0287 0.0144
Dimension 57 0.0024 0.1707 -0.0861 0.0135 -0.0839 0.1380 -0.0818 0.0605 -0.1369 0.0283 ... 0.0155 0.0275 0.0763 -0.0485 -0.0370 0.0370 -0.1373 0.0231 -0.0121 0.0033
Dimension 58 0.0023 0.1133 -0.1151 -0.0006 -0.0409 0.1453 -0.1458 -0.2055 -0.1357 0.1860 ... -0.0029 -0.0671 0.2345 0.0239 -0.0309 0.0309 0.1145 0.0491 0.0298 0.0320
Dimension 59 0.0022 -0.0227 0.0133 0.0003 0.0042 -0.0140 0.0901 0.0729 -0.0135 -0.0147 ... 0.0035 -0.0477 0.1240 0.0295 0.0130 -0.0130 0.0658 0.0071 0.0160 0.0062
Dimension 60 0.0021 0.0179 -0.0146 0.0001 -0.1029 0.1139 -0.1515 -0.3066 -0.3418 0.1198 ... -0.1328 0.0480 0.1454 -0.0054 -0.0349 0.0349 0.3799 0.0975 0.0044 0.0015
Dimension 61 0.0020 -0.0161 0.0132 0.0030 -0.0352 -0.0032 0.0640 0.0931 0.0757 -0.0318 ... -0.1456 0.1304 0.0551 -0.0561 -0.0021 0.0021 -0.1240 -0.0322 -0.0129 -0.0324
Dimension 62 0.0019 0.0853 -0.0642 0.0033 -0.0071 0.1115 -0.0087 0.2364 0.1553 -0.0902 ... -0.1269 -0.0831 0.3752 0.0391 0.0297 -0.0297 -0.5259 -0.0450 0.0056 0.0035
Dimension 63 0.0018 0.1085 -0.0644 0.0206 -0.1479 0.2111 -0.3121 -0.0902 -0.1644 0.0590 ... 0.1879 0.1186 -0.3953 -0.0575 -0.0725 0.0725 -0.3926 -0.0264 -0.0070 -0.0193
Dimension 64 0.0018 0.0653 -0.0893 -0.0131 0.0779 -0.0987 0.2390 0.1146 0.1505 -0.0211 ... -0.0893 -0.0193 0.1646 -0.0113 0.0396 -0.0396 0.1814 -0.0206 0.0152 0.0072
Dimension 65 0.0017 -0.0844 0.0987 0.0124 -0.1717 0.1801 -0.1984 -0.0921 -0.1765 -0.0161 ... -0.1312 -0.0755 0.3288 0.0293 -0.0497 0.0497 -0.2389 -0.0017 0.0419 0.0184
Dimension 66 0.0016 0.0169 -0.0031 0.0061 -0.0302 0.0150 -0.0291 -0.0309 -0.0146 0.0153 ... -0.7171 0.1179 -0.2417 -0.0383 -0.0319 0.0319 -0.0243 0.0069 -0.0107 -0.0127
Dimension 67 0.0013 -0.1653 0.1220 0.0052 -0.5032 0.0932 0.3052 -0.1995 0.1050 -0.2247 ... 0.0724 -0.0670 0.1306 0.0349 0.0760 -0.0760 -0.0225 0.0171 0.0117 -0.0038
Dimension 68 0.0013 -0.0541 0.0887 -0.0037 0.3477 -0.2863 -0.3648 0.2451 -0.1497 0.1917 ... 0.0285 -0.0298 0.2457 0.0008 -0.1030 0.1030 0.0338 -0.0643 -0.0008 -0.0016
Dimension 69 0.0011 0.3231 -0.0247 -0.0033 -0.0116 -0.0309 0.0111 0.0815 0.0833 -0.0690 ... 0.0199 -0.0219 0.0340 0.0049 0.0199 -0.0199 -0.0492 0.5723 -0.0054 0.0055
Dimension 70 0.0011 -0.3483 0.1156 -0.0067 -0.3722 -0.5496 -0.2914 0.1427 -0.1095 -0.1372 ... -0.0111 0.0214 -0.0251 -0.0085 -0.0367 0.0367 -0.0484 0.1653 0.0063 -0.0148
Dimension 71 0.0010 0.2033 -0.0797 -0.0025 -0.0243 -0.0900 -0.0027 0.0309 0.0194 0.0211 ... 0.0029 -0.0061 0.0308 0.0091 -0.0049 0.0049 0.0610 -0.1263 0.0171 0.0064
Dimension 72 0.0010 -0.5029 0.1949 0.0020 0.4101 0.4226 0.0645 -0.0784 0.0110 0.0942 ... 0.0001 -0.0114 -0.0377 0.0082 0.0069 -0.0069 -0.1171 0.1830 -0.0121 -0.0027
Dimension 73 0.0005 0.3715 0.8037 0.0008 0.0356 0.0358 0.0259 -0.0224 -0.0063 -0.0256 ... 0.0009 0.0072 -0.0150 -0.0017 0.0054 -0.0054 0.0538 -0.0501 -0.0013 0.0034
Dimension 74 0.0004 0.0218 0.0726 0.0003 0.0184 0.0210 -0.0183 0.0093 0.0099 0.0017 ... 0.0006 -0.0041 0.0015 0.0001 -0.0020 0.0020 0.0603 -0.0212 -0.0063 -0.0033
Dimension 75 0.0001 -0.0015 -0.0130 0.0009 0.0074 0.0106 0.0229 0.0325 -0.0019 0.0018 ... 0.0001 -0.0005 0.0015 -0.0005 0.0031 -0.0031 -0.0024 -0.0040 -0.0001 -0.0019
Dimension 76 0.0001 -0.0024 -0.0035 -0.0047 -0.0093 0.0215 0.0163 0.0316 0.0109 0.0253 ... -0.0003 0.0004 -0.0060 0.0008 0.0021 -0.0021 0.0057 -0.0051 0.0006 0.0023
Dimension 77 0.0000 0.0030 0.0001 -0.0006 -0.0063 -0.0042 0.0146 0.0044 -0.0013 0.0015 ... 0.0001 0.0002 -0.0009 0.0001 0.0011 -0.0011 -0.0235 0.0055 -0.0011 0.0006
Dimension 78 0.0000 0.0000 0.0000 0.0000 -0.0000 0.0000 0.0000 -0.0000 0.0000 -0.0000 ... 0.0000 -0.0000 -0.0000 0.0000 0.7071 0.7071 -0.0000 -0.0000 -0.0000 0.0000

78 rows × 79 columns

In [59]:
for comp in range(3, new_azdias.shape[1]):
    pca, X_pca = doPCA(new_azdias, comp)
    comp_check = pca_results(new_azdias, pca)
    if comp_check['Explained Variance'].sum() > 0.85:
        break
        

num_comps = comp_check.shape[0]
print("Using {} components, we can explain {}% of the variability in the original data.".format(comp_check.shape[0],comp_check['Explained Variance'].sum()))
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
/opt/conda/lib/python3.6/site-packages/matplotlib/pyplot.py:523: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
pcanone
else
fit
Using 30 components, we can explain 0.8513000000000002% of the variability in the original data.
In [60]:
# Re-apply PCA to the data while selecting for number of components to retain.
# 
#From the above analysis, I will choose 30 components for my PCA

pca, X_pca = doPCA(new_azdias, 30)

X_pca = pca.fit_transform(scaler)
pcanone
else
fit
In [61]:
def scree_plot(pca):
    num_components=len(pca.explained_variance_ratio_)
    ind = np.arange(num_components)
    vals = pca.explained_variance_ratio_
 
    plt.figure(figsize=(10, 6))
    ax = plt.subplot(111)
    cumvals = np.cumsum(vals)
    ax.bar(ind, vals)
    ax.plot(ind, cumvals)
    for i in range(num_components):
        ax.annotate(r"%s%%" % ((str(vals[i]*100)[:4])), (ind[i]+0.2, vals[i]), va="bottom", ha="center", fontsize=12)
 
    ax.xaxis.set_tick_params(width=0)
    ax.yaxis.set_tick_params(width=2, length=12)
 
    ax.set_xlabel("Principal Component")
    ax.set_ylabel("Variance Explained (%)")
    plt.title('Explained Variance Per Principal Component')
    
scree_plot(pca)

Discussion 2.2: Perform Dimensionality Reduction

(Double-click this cell and replace this text with your own text, reporting your findings and decisions regarding dimensionality reduction. How many principal components / transformed features are you retaining for the next step of the analysis?)

Answer:

I tried multiple components above, and decided 30 components are good enough, I can explain more than 85% of the original data's variability, which is good enough.

Step 2.3: Interpret Principal Components

Now that we have our transformed principal components, it's a nice idea to check out the weight of each variable on the first few components to see if they can be interpreted in some fashion.

As a reminder, each principal component is a unit vector that points in the direction of highest variance (after accounting for the variance captured by earlier principal components). The further a weight is from zero, the more the principal component is in the direction of the corresponding feature. If two features have large weights of the same sign (both positive or both negative), then increases in one tend expect to be associated with increases in the other. To contrast, features with different signs can be expected to show a negative correlation: increases in one variable should result in a decrease in the other.

  • To investigate the features, you should map each weight to their corresponding feature name, then sort the features according to weight. The most interesting features for each principal component, then, will be those at the beginning and end of the sorted list. Use the data dictionary document to help you understand these most prominent features, their relationships, and what a positive or negative value on the principal component might indicate.
  • You should investigate and interpret feature associations from the first three principal components in this substep. To help facilitate this, you should write a function that you can call at any time to print the sorted list of feature weights, for the i-th principal component. This might come in handy in the next step of the project, when you interpret the tendencies of the discovered clusters.
In [62]:
df_pca_loadings = pd.DataFrame(pca.components_)
In [63]:
ar = {}

for i in range(78):
    ar[df_pca_loadings.columns[i]] = new_azdias.columns[i]
In [64]:
df_pca_loadings.rename(columns=ar, inplace=True)
df_pca_loadings_T = pd.DataFrame(df_pca_loadings.T)
In [65]:
df_pca_loadings_T
Out[65]:
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 29
ALTERSKATEGORIE_GROB -0.108656 0.255844 0.066588 0.001710 0.050587 -0.042809 -0.002511 -0.014222 0.010841 -0.114215 ... 0.016599 -0.026596 0.030289 0.037325 0.084197 -0.031768 -0.081931 0.073152 0.038041 -0.016667
ANREDE_KZ 0.009166 0.077984 -0.362914 0.024321 -0.024193 0.035851 0.026265 -0.014723 0.056854 0.078753 ... -0.019219 -0.023145 0.006003 0.011844 0.014566 0.032640 0.028094 -0.041294 -0.029435 -0.003560
CJT_GESAMTTYP 0.038702 -0.122440 -0.030779 0.066718 -0.018520 0.020067 -0.030817 0.016422 -0.103008 -0.085843 ... -0.032302 -0.495944 0.220374 -0.314050 0.016602 -0.013059 0.069332 -0.035968 -0.243360 0.045615
FINANZ_MINIMALIST -0.206617 0.072395 0.147583 0.019540 -0.029818 0.075323 -0.061441 0.024503 -0.084493 0.139834 ... 0.022360 0.039743 0.005539 -0.007229 0.049849 0.028875 -0.013448 0.060110 0.040077 -0.019206
FINANZ_SPARER 0.129909 -0.232728 -0.088159 -0.007761 -0.008253 0.004424 -0.000070 0.018306 -0.047054 -0.178692 ... -0.007704 0.027291 0.014682 0.043494 0.090293 -0.025049 -0.006886 -0.024302 -0.012469 -0.099347
FINANZ_VORSORGER -0.102374 0.222510 0.083427 0.004791 0.003351 -0.130956 -0.038547 0.035320 0.087336 0.149148 ... -0.007270 -0.074217 -0.006736 -0.045089 -0.130958 0.028059 0.052694 -0.046647 0.006881 0.141239
FINANZ_ANLEGER 0.063877 -0.198551 -0.173650 -0.062211 0.126734 0.053931 0.016194 -0.019102 -0.100870 -0.118126 ... 0.002258 -0.010193 -0.029974 0.026004 0.076188 0.000856 -0.007882 -0.126330 0.001170 -0.108542
FINANZ_UNAUFFAELLIGER 0.065998 -0.223690 -0.086922 -0.001149 -0.127554 -0.058798 -0.045362 0.060143 0.041931 -0.201858 ... -0.006855 0.088763 0.015656 -0.034727 0.047206 -0.017377 -0.006310 0.074645 -0.050377 -0.057570
FINANZ_HAUSBAUER 0.153139 0.104447 -0.047195 -0.024969 0.088534 -0.043870 0.118503 -0.119986 0.177993 -0.026312 ... 0.039801 -0.084147 0.021863 -0.062974 -0.199386 -0.041889 0.066793 0.027371 -0.036042 0.163596
FINANZTYP -0.057379 0.108792 0.112096 0.023602 0.087530 0.149524 0.031470 -0.079497 -0.165002 0.156602 ... -0.011063 -0.091592 -0.036914 0.151581 0.108150 0.000390 -0.105476 -0.157965 0.085307 -0.164497
GFK_URLAUBERTYP 0.064451 -0.037260 -0.022349 -0.025008 0.025359 0.002003 0.025413 -0.024894 -0.044252 -0.067179 ... 0.053641 0.236840 -0.018998 -0.376564 0.233984 -0.022332 0.008976 0.061865 0.033370 0.002172
GREEN_AVANTGARDE -0.118003 -0.019332 0.043083 0.167315 -0.286927 0.131531 0.043082 -0.021030 0.017205 -0.087153 ... 0.081547 -0.195336 -0.146461 0.115814 -0.144518 -0.000378 -0.089370 -0.034022 -0.112487 0.098698
HEALTH_TYP 0.031185 -0.050512 -0.005306 -0.011753 -0.021684 0.012065 -0.073749 0.096203 -0.330863 -0.147993 ... 0.035934 0.101818 -0.060124 -0.114740 -0.271966 -0.059531 0.139082 0.044464 0.202149 0.111032
LP_LEBENSPHASE_FEIN -0.159113 -0.058900 -0.015053 0.313931 0.171325 -0.086820 0.007116 -0.019579 0.026968 0.038394 ... 0.002343 0.041692 -0.030361 0.021331 -0.025040 -0.001219 -0.017608 0.053343 -0.004753 0.026360
LP_LEBENSPHASE_GROB -0.144462 -0.066134 -0.018396 0.322685 0.186660 -0.098014 0.009226 -0.016096 0.027745 0.031253 ... -0.004456 0.049372 -0.024311 0.020970 -0.034258 -0.004731 -0.019193 0.052170 0.000282 0.018697
LP_FAMILIE_FEIN -0.104984 -0.092154 -0.033438 0.332553 0.205887 -0.124113 0.019288 -0.023823 0.046573 0.000533 ... -0.014740 0.033403 -0.029357 0.020085 -0.048810 -0.018839 -0.021563 0.045602 -0.010972 0.029354
LP_FAMILIE_GROB -0.106859 -0.092992 -0.034934 0.334966 0.207028 -0.122276 0.018963 -0.026316 0.045040 0.000211 ... -0.011965 0.034917 -0.029162 0.022510 -0.046537 -0.017474 -0.023242 0.046492 -0.009542 0.032491
LP_STATUS_FEIN -0.220671 -0.041236 0.016234 0.032667 -0.083627 0.130808 -0.035426 0.034382 -0.081782 0.084626 ... 0.039131 0.056987 -0.005761 0.008259 0.111450 0.039462 0.004839 0.069559 -0.002332 -0.027069
LP_STATUS_GROB -0.220252 -0.023093 0.030862 0.049311 -0.111057 0.130513 -0.037760 0.035094 -0.081546 0.084235 ... 0.044213 0.042369 -0.030992 0.024517 0.100835 0.043383 -0.006992 0.075046 -0.008918 -0.018821
NATIONALITAET_KZ 0.056920 -0.069091 -0.022599 0.014265 0.002485 -0.025804 -0.003722 0.017904 0.010265 0.111802 ... 0.845483 -0.251058 0.050512 0.008938 0.159744 0.032292 -0.035522 -0.053374 0.063143 -0.075507
RETOURTYP_BK_S -0.020402 0.162116 0.108611 0.013492 -0.009738 -0.038199 -0.021744 0.020022 -0.010553 0.185867 ... -0.019062 0.125812 -0.045573 0.102265 -0.054567 -0.021848 0.001807 -0.154319 -0.446490 -0.163416
SEMIO_SOZ 0.032731 -0.096486 0.260398 0.009745 0.004735 -0.009287 0.011629 0.023959 -0.034455 -0.081813 ... 0.092025 0.090715 -0.026453 -0.126795 -0.167477 0.038289 0.126266 0.079140 -0.023060 0.024818
SEMIO_FAM 0.065524 -0.171026 0.255115 -0.024285 -0.003307 -0.019469 0.012429 -0.011241 0.025372 0.042048 ... -0.005686 -0.001476 -0.019549 -0.033653 -0.053065 -0.048766 -0.015079 0.090449 0.010401 -0.011397
SEMIO_REL 0.092036 -0.255914 0.089170 -0.024048 -0.013279 0.020784 -0.004263 -0.002813 0.044574 0.193893 ... -0.033783 -0.000687 -0.024019 -0.002414 -0.025603 0.018231 -0.004681 -0.003489 -0.011345 0.011142
SEMIO_MAT 0.073100 -0.149077 0.056873 -0.014514 -0.039917 0.020233 0.073218 -0.124330 0.228061 0.286123 ... -0.065362 -0.037711 -0.050386 0.033032 -0.012056 -0.148436 -0.113160 0.278006 0.035217 -0.235753
SEMIO_VERT -0.034725 -0.057714 0.339663 -0.001234 0.017167 -0.025562 0.001405 -0.014440 0.012974 -0.072394 ... 0.032436 0.028313 -0.010687 -0.046082 -0.016935 -0.028283 -0.029538 0.048631 0.017137 0.027828
SEMIO_LUST -0.080013 0.176051 0.059533 0.019067 0.017705 -0.030051 0.042419 -0.029149 0.086755 -0.052887 ... 0.033581 -0.127778 -0.018757 -0.106588 -0.011419 -0.181797 -0.153097 0.424084 -0.038493 -0.275036
SEMIO_ERL -0.070148 0.224644 -0.178921 0.016242 0.036443 -0.008511 -0.010600 0.000324 -0.008271 -0.054099 ... 0.020365 -0.012506 0.001300 -0.019332 0.025919 -0.021813 -0.030094 0.066949 -0.000760 -0.075169
SEMIO_KULT 0.058737 -0.209293 0.236925 -0.023426 0.019193 -0.000283 -0.049213 0.017359 -0.034757 0.122856 ... 0.014774 0.038511 -0.043568 -0.036218 -0.061856 -0.004662 -0.011422 0.002835 0.032521 0.021550
SEMIO_RAT 0.084095 -0.172200 -0.204275 0.005726 -0.054900 0.057452 0.033934 -0.028239 0.110161 0.292327 ... -0.060869 0.006128 -0.031454 -0.006648 -0.059799 -0.003752 0.002562 0.061889 -0.015299 -0.078014
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
WOHNLAGE -0.064969 -0.044371 -0.023491 -0.168025 0.232929 -0.033969 0.105690 0.055597 0.040912 -0.013443 ... 0.048848 0.069437 0.045427 0.011525 0.010733 0.098248 0.112833 -0.108726 -0.581482 -0.000631
KBA05_ANTG1 -0.206579 -0.069231 -0.026606 -0.027121 -0.084512 0.091849 -0.068771 0.071589 -0.064238 0.070503 ... 0.015447 0.094669 0.087222 -0.057569 0.039145 -0.049996 0.054827 0.040137 -0.005818 0.136496
KBA05_ANTG2 0.011815 -0.007041 -0.011262 -0.035993 -0.057609 -0.120450 -0.216353 -0.338821 0.149161 -0.100386 ... -0.019352 -0.220797 -0.318613 -0.036554 -0.056847 -0.217576 0.055083 -0.272928 0.032930 -0.074751
KBA05_ANTG3 0.126325 0.040407 0.008713 0.008786 0.059123 -0.095323 -0.091707 -0.247488 0.075241 -0.110089 ... 0.055302 0.046117 0.067920 0.138969 0.055510 0.472551 0.023261 0.263050 0.008103 -0.082763
KBA05_ANTG4 0.134713 0.055930 0.029857 0.058335 0.059343 0.001608 0.282331 0.355838 -0.080113 0.069454 ... -0.048141 -0.028506 0.011482 -0.034755 -0.051347 -0.252665 -0.150532 -0.152713 0.012861 -0.072582
KBA05_GBZ -0.195236 -0.074099 -0.028660 -0.059228 -0.053891 0.107145 -0.131367 -0.096792 -0.007611 0.020889 ... 0.013029 0.051235 0.050807 -0.038766 0.006792 0.020015 0.125717 0.065163 -0.028153 0.146261
BALLRAUM -0.102444 -0.047357 -0.036065 -0.165521 0.168795 -0.030940 0.035757 -0.107138 0.048437 -0.067430 ... -0.004909 -0.021868 -0.083147 -0.073719 -0.088738 0.017741 -0.156923 0.028394 0.099676 -0.054679
EWDICHTE 0.161852 0.073128 0.048548 0.208123 -0.194558 0.012152 -0.111572 0.046221 -0.051724 0.059395 ... -0.025903 0.034744 0.026934 0.013460 0.034959 -0.021126 0.027000 0.006745 -0.011592 -0.017683
INNENSTADT -0.136299 -0.060136 -0.045349 -0.175995 0.139221 -0.069316 0.058495 -0.065733 0.063676 -0.072631 ... 0.017009 -0.054170 -0.059385 -0.084677 -0.048155 0.014389 -0.122275 0.028782 0.156193 0.033895
GEBAEUDETYP_RASTER -0.099402 -0.032879 -0.032141 -0.067695 0.054849 -0.008260 -0.190005 -0.098380 -0.052113 0.056334 ... 0.024422 0.056977 -0.012848 0.163619 0.230329 -0.521682 0.245924 0.100365 -0.093402 0.067019
KKK 0.042072 0.002201 -0.014538 -0.061009 0.248368 -0.069926 -0.306090 0.122173 -0.270018 0.161764 ... 0.050102 -0.029564 -0.143343 -0.021058 -0.026405 0.041031 -0.101062 -0.061722 0.041953 -0.009743
MOBI_REGIO -0.219048 -0.074977 -0.033058 -0.049124 -0.051919 0.104671 -0.103181 -0.037884 -0.033234 0.039955 ... 0.015758 0.069930 0.060790 -0.051360 0.028407 -0.018516 0.088146 0.039914 -0.018385 0.126166
ONLINE_AFFINITAET -0.051077 -0.166613 -0.056706 0.151831 0.004456 0.033634 0.009165 0.007002 -0.061765 -0.172263 ... 0.020504 -0.087277 0.037184 -0.015280 -0.057976 0.000447 0.018874 0.040469 -0.013989 0.019043
REGIOTYP 0.066040 0.017373 -0.004075 -0.032441 0.171124 -0.112498 -0.314057 0.154178 -0.237555 0.187991 ... 0.041558 -0.070828 -0.156798 -0.043732 -0.111052 0.025590 -0.097151 0.032647 0.064999 -0.058913
KBA13_ANZAHL_PKW -0.068851 -0.031255 -0.024819 -0.059677 -0.116948 -0.337994 0.276024 -0.216347 -0.304803 0.104006 ... 0.008666 0.012105 -0.004897 0.019596 0.023423 -0.017542 0.027375 -0.018344 -0.026384 0.027419
PLZ8_ANTG1 -0.194755 -0.076764 -0.049247 -0.121685 -0.017888 -0.046325 -0.019421 0.114625 -0.002363 0.027740 ... -0.016364 -0.095222 0.041932 0.049590 -0.119806 0.114499 -0.031988 0.021046 0.067701 -0.089901
PLZ8_ANTG2 0.128558 0.050312 0.029846 0.096127 -0.102948 -0.125973 -0.152588 -0.217740 0.080029 -0.057386 ... 0.012547 0.082563 -0.140819 -0.101071 0.161361 -0.195437 0.007265 -0.060497 -0.040721 0.056817
PLZ8_ANTG3 0.193389 0.077471 0.047765 0.141865 -0.038542 -0.008766 -0.009831 -0.144385 -0.026691 -0.019894 ... 0.027470 0.116089 -0.034342 -0.057183 0.133600 -0.109517 0.003153 -0.001173 -0.004865 0.084288
PLZ8_ANTG4 0.186654 0.073530 0.048503 0.138957 -0.006310 0.038482 0.097175 0.022393 -0.057687 0.006173 ... 0.016429 0.100989 0.021717 -0.032499 0.096865 -0.068236 -0.009805 0.002157 0.020887 0.093029
PLZ8_HHZ 0.033713 0.008335 0.005118 0.029651 -0.166310 -0.295991 0.267081 -0.242703 -0.362600 0.131150 ... -0.006352 0.039224 0.036697 0.005731 0.021984 0.007918 0.029151 0.007009 -0.027959 0.022815
PLZ8_GBZ -0.145179 -0.061462 -0.040789 -0.103091 -0.124816 -0.265685 0.131078 -0.191397 -0.224714 0.091286 ... -0.020092 -0.045577 0.041189 0.031089 -0.074407 0.070812 0.010027 0.014262 -0.017650 -0.055591
ARBEIT 0.124263 0.058557 0.038195 0.112653 0.030082 0.216206 -0.159823 -0.181061 -0.151138 0.093325 ... -0.044082 0.005065 0.065361 -0.022674 -0.062369 0.119327 0.015272 0.006223 -0.109321 -0.071466
ORTSGR_KLS9 0.164881 0.073238 0.049944 0.212667 -0.169923 0.071454 -0.108133 0.037212 -0.096551 0.067014 ... -0.025864 -0.016764 0.037606 0.027218 0.012275 0.012504 0.019897 0.015183 0.121934 0.042671
RELAT_AB 0.111637 0.052134 0.031815 0.118166 -0.052481 0.057178 -0.238093 -0.063898 -0.029295 0.046545 ... -0.071020 0.057979 0.084948 -0.081192 -0.036684 0.217574 -0.029202 -0.067103 -0.201768 -0.087637
OST_WEST_KZ_O 0.057117 0.032849 0.022306 0.006172 0.229756 0.397579 0.208061 -0.255807 -0.153694 0.006723 ... 0.056710 -0.008664 0.004730 0.011969 -0.030607 -0.040278 0.034277 0.041955 0.014535 0.007008
OST_WEST_KZ_W -0.057117 -0.032849 -0.022306 -0.006172 -0.229756 -0.397579 -0.208061 0.255807 0.153694 -0.006723 ... -0.056710 0.008664 -0.004730 -0.011969 0.030607 0.040278 -0.034277 -0.041955 -0.014535 -0.007008
PRAEGENDE_JUGENDJAHRE_DECADE 0.090807 -0.246587 -0.094973 0.013050 -0.026416 0.034272 -0.012950 0.031210 -0.097059 -0.181892 ... -0.023009 0.032566 0.016615 0.059724 0.104059 -0.013646 -0.046335 -0.020825 -0.016939 -0.051879
PRAEGENDE_JUGENDJAHRE_IS_MAINSTREAM 0.099108 0.039050 -0.033717 -0.160343 0.270182 -0.143545 -0.047208 0.024711 0.005690 0.111026 ... -0.082260 0.224803 0.179063 -0.158548 0.152809 -0.001280 0.109934 0.072526 0.130326 -0.089108
CAMEO_INTL_2015_WEALTH 0.183276 0.059364 0.028489 0.017259 0.127131 -0.062383 -0.063567 -0.038618 -0.012400 0.015940 ... 0.001298 -0.095745 0.053187 0.160397 -0.070248 -0.052374 0.159588 0.041584 0.008601 0.007429
CAMEO_INTL_2015_LIFE -0.111780 0.003808 -0.012334 -0.008954 -0.049371 0.085341 0.064664 -0.047207 0.018760 0.007207 ... 0.049180 0.212283 -0.336657 -0.337390 0.169566 0.056361 -0.420281 -0.191130 -0.115064 -0.035436

78 rows × 30 columns

In [66]:
# Map weights for the first principal component to corresponding feature names
# and then print the linked values, sorted by weight.
# HINT: Try defining a function here or in a new cell that you can reuse in the
# other cells.
# df_weight = pd.DataFrame(data=pca.components_.T, index=pca.feature_names_in_)
# df_pca_loadings_T = pd.DataFrame(pcadf_pca_loadings.T)

def map_weights(df, comp):
    figure(figsize=(20, 5), dpi=80)
    plt.xticks(rotation=90)
    plt.scatter(ar.values(), df[comp])
    plt.show()
    
map_weights(df_pca_loadings_T, 0)
In [67]:
# Map weights for the second principal component to corresponding feature names
# and then print the linked values, sorted by weight.
map_weights(df_pca_loadings_T, 1)
In [68]:
# Map weights for the third principal component to corresponding feature names
# and then print the linked values, sorted by weight.

map_weights(df_pca_loadings_T, 2)
In [69]:
# I tried 29 and 30 for comparison only
map_weights(df_pca_loadings_T, 28)
map_weights(df_pca_loadings_T, 29)
In [70]:
PCA_components = pd.DataFrame(X_pca)

plt.scatter(PCA_components[0], PCA_components[1])
Out[70]:
<matplotlib.collections.PathCollection at 0x7fe2a5dddc18>

Discussion 2.3: Interpret Principal Components

(Double-click this cell and replace this text with your own text, reporting your observations from detailed investigation of the first few principal components generated. Can we interpret positive and negative values from them in a meaningful way?)

ANSWER:

Yes, FINANZ_MINIMALIST is always positive for example But over all, I notice my columns change values significantly on each conponent

Type of people in the first component: Based on the some of the highest values in this component. Type of people being reflected in this component are affected by family houses in the PLZ8 region (deducted from columns PLZ8_ANTG3 and PLZ8_ANTG4), and by estimated household net income (deducted from column HH_EINKOMMEN_SCORE)

Type of people in the second component: Type of people being reflected in this component are affected by estimated age based on given name analysis (deducted from column ALTERSKATEGORIE_GROB), and financial typology (deducted from column FINANZ_VORSORGER)

Type of people in the third component: Type of people being reflected in this component are affected by personality typology (deducted from columns SEMIO_VERT, SEMIO_SOZ, and SEMIO_FAM)

Step 3: Clustering

Step 3.1: Apply Clustering to General Population

You've assessed and cleaned the demographics data, then scaled and transformed them. Now, it's time to see how the data clusters in the principal components space. In this substep, you will apply k-means clustering to the dataset and use the average within-cluster distances from each point to their assigned cluster's centroid to decide on a number of clusters to keep.

  • Use sklearn's KMeans class to perform k-means clustering on the PCA-transformed data.
  • Then, compute the average difference from each point to its assigned cluster's center. Hint: The KMeans object's .score() method might be useful here, but note that in sklearn, scores tend to be defined so that larger is better. Try applying it to a small, toy dataset, or use an internet search to help your understanding.
  • Perform the above two steps for a number of different cluster counts. You can then see how the average distance decreases with an increasing number of clusters. However, each additional cluster provides a smaller net benefit. Use this fact to select a final number of clusters in which to group the data. Warning: because of the large size of the dataset, it can take a long time for the algorithm to resolve. The more clusters to fit, the longer the algorithm will take. You should test for cluster counts through at least 10 clusters to get the full picture, but you shouldn't need to test for a number of clusters above about 30.
  • Once you've selected a final number of clusters to use, re-fit a KMeans instance to perform the clustering operation. Make sure that you also obtain the cluster assignments for the general demographics data, since you'll be using them in the final Step 3.3.
In [71]:
# Over a number of different cluster counts...
from sklearn.cluster import KMeans

# I tried n_clusters but the higher I go, the longer it takes, so I will stick with few tries below this cell

# for i in range(1,55):
#     kmeans = KMeans(n_clusters=i, random_state=42).fit(X_pca)
#     if kmeans.score(X_pca) >= 0:
#     print(kmeans.score(pca))



    # compute the average within-cluster distances.
    
In [72]:
# Investigate the change in within-cluster distance across number of clusters.
# HINT: Use matplotlib's plot function to visualize this relationship.
from sklearn.cluster import MiniBatchKMeans

n = range(1,100)

score = []

for i in n:
    print(i)
    kmeans = MiniBatchKMeans(n_clusters=i, random_state=42).fit(X_pca)
    score.append(kmeans.inertia_)
print("DONE")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
DONE
In [73]:
plt.plot(n, score)
plt.xlabel('number of clusters, k')
plt.ylabel('score')
plt.show()

# #I woul say 25 clusters seems to be the elbow point 
In [74]:
# Re-fit the k-means model with the selected number of clusters and obtain
# cluster predictions for the general population demographics data.

kmeans = KMeans(n_clusters=25, random_state=42).fit(X_pca)

Discussion 3.1: Apply Clustering to General Population

(Double-click this cell and replace this text with your own text, reporting your findings and decisions regarding clustering. Into how many clusters have you decided to segment the population?)

Answer:

I decided to segment the population with 25 clusters, I think this is the elbow point (it's a bit hard to tell, since the curve is too smooth for the naked eye) I also made sure ti have random_state = 42 for your review.

cluster 12 is over presented, and cluster 20 is under presented

Step 3.2: Apply All Steps to the Customer Data

Now that you have clusters and cluster centers for the general population, it's time to see how the customer data maps on to those clusters. Take care to not confuse this for re-fitting all of the models to the customer data. Instead, you're going to use the fits from the general population to clean, transform, and cluster the customer data. In the last step of the project, you will interpret how the general population fits apply to the customer data.

  • Don't forget when loading in the customers data, that it is semicolon (;) delimited.
  • Apply the same feature wrangling, selection, and engineering steps to the customer demographics using the clean_data() function you created earlier. (You can assume that the customer demographics data has similar meaning behind missing data patterns as the general demographics data.)
  • Use the sklearn objects from the general demographics data, and apply their transformations to the customers data. That is, you should not be using a .fit() or .fit_transform() method to re-fit the old objects, nor should you be creating new sklearn objects! Carry the data through the feature scaling, PCA, and clustering steps, obtaining cluster assignments for all of the data in the customer demographics data.
In [75]:
# Load in the customer demographics data.
customers = pd.read_csv('Udacity_CUSTOMERS_Subset.csv', sep=';')

customers
Out[75]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 2 4 1 5.0 5 1 5 1 2 2 ... 3.0 3.0 1.0 0.0 1.0 5.0 5.0 1.0 2.0 1.0
1 -1 4 1 NaN 5 1 5 1 3 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 -1 4 2 2.0 5 1 5 1 4 4 ... 2.0 3.0 3.0 1.0 3.0 3.0 2.0 3.0 5.0 3.0
3 1 4 1 2.0 5 1 5 2 1 2 ... 3.0 2.0 1.0 0.0 1.0 3.0 4.0 1.0 3.0 1.0
4 -1 3 1 6.0 3 1 4 4 5 2 ... 2.0 4.0 2.0 1.0 2.0 3.0 3.0 3.0 5.0 1.0
5 1 3 1 4.0 5 1 5 1 2 3 ... 2.0 3.0 2.0 1.0 1.0 5.0 5.0 3.0 7.0 5.0
6 2 4 1 2.0 5 1 5 1 1 2 ... 3.0 2.0 1.0 0.0 1.0 5.0 5.0 2.0 3.0 2.0
7 1 4 1 2.0 5 1 5 1 2 2 ... 3.0 3.0 1.0 1.0 1.0 3.0 3.0 3.0 4.0 3.0
8 2 4 2 1.0 2 2 5 1 1 5 ... 1.0 4.0 2.0 1.0 5.0 5.0 4.0 3.0 8.0 3.0
9 1 3 1 3.0 5 2 4 1 3 1 ... 3.0 3.0 1.0 0.0 1.0 5.0 5.0 3.0 6.0 4.0
10 -1 3 2 5.0 4 2 4 4 3 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 1 4 1 3.0 5 1 5 1 3 2 ... 4.0 2.0 1.0 0.0 1.0 3.0 3.0 2.0 5.0 1.0
12 -1 4 1 5.0 5 2 4 3 2 1 ... 3.0 1.0 0.0 0.0 1.0 4.0 5.0 1.0 1.0 1.0
13 -1 3 1 6.0 5 2 4 2 4 1 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 4.0 5.0 5.0
14 2 4 2 2.0 3 1 5 1 2 5 ... 1.0 4.0 3.0 1.0 3.0 5.0 3.0 3.0 8.0 5.0
15 2 3 1 3.0 5 1 5 1 1 2 ... 3.0 2.0 1.0 0.0 1.0 5.0 5.0 3.0 6.0 1.0
16 1 4 1 2.0 5 1 5 1 3 2 ... 4.0 1.0 0.0 0.0 1.0 3.0 4.0 3.0 5.0 3.0
17 -1 4 1 5.0 4 3 1 4 5 1 ... 2.0 3.0 2.0 1.0 1.0 5.0 5.0 1.0 3.0 1.0
18 -1 2 1 4.0 2 4 2 2 3 3 ... 1.0 3.0 3.0 2.0 5.0 3.0 1.0 3.0 8.0 5.0
19 -1 4 2 4.0 3 2 4 3 3 3 ... 4.0 2.0 1.0 0.0 1.0 3.0 4.0 3.0 5.0 5.0
20 -1 4 1 6.0 5 3 4 2 3 1 ... 4.0 1.0 0.0 0.0 1.0 3.0 4.0 4.0 6.0 5.0
21 2 4 1 1.0 3 1 5 2 1 4 ... 1.0 4.0 3.0 1.0 3.0 4.0 3.0 1.0 3.0 1.0
22 2 1 2 2.0 5 1 5 1 1 2 ... 3.0 2.0 0.0 0.0 1.0 4.0 5.0 1.0 2.0 1.0
23 -1 3 1 6.0 3 3 4 1 2 2 ... 1.0 3.0 3.0 2.0 5.0 5.0 2.0 3.0 9.0 5.0
24 2 4 2 2.0 4 1 5 2 2 3 ... 3.0 3.0 1.0 0.0 1.0 3.0 3.0 2.0 6.0 3.0
25 -1 3 2 3.0 5 2 4 4 2 1 ... 3.0 2.0 1.0 0.0 1.0 3.0 4.0 1.0 3.0 1.0
26 1 4 2 2.0 5 1 5 2 1 3 ... 3.0 3.0 1.0 0.0 1.0 5.0 5.0 1.0 2.0 1.0
27 -1 3 1 3.0 5 2 3 1 2 1 ... 2.0 4.0 2.0 1.0 2.0 4.0 3.0 3.0 7.0 4.0
28 2 4 2 2.0 3 1 5 1 1 5 ... 1.0 3.0 3.0 2.0 4.0 5.0 3.0 3.0 7.0 5.0
29 0 3 2 4.0 3 1 5 1 2 5 ... 1.0 4.0 3.0 2.0 5.0 3.0 2.0 3.0 8.0 5.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
191622 2 4 1 2.0 5 1 5 1 1 2 ... 3.0 3.0 1.0 0.0 1.0 3.0 3.0 2.0 2.0 2.0
191623 1 4 1 2.0 5 1 5 1 3 2 ... 2.0 4.0 2.0 1.0 2.0 4.0 3.0 2.0 4.0 3.0
191624 1 4 1 2.0 5 1 5 1 4 2 ... 2.0 3.0 2.0 1.0 1.0 4.0 4.0 2.0 5.0 1.0
191625 2 4 2 4.0 5 1 5 2 1 2 ... 3.0 2.0 1.0 1.0 1.0 5.0 5.0 2.0 4.0 3.0
191626 -1 2 1 4.0 3 3 3 1 3 2 ... 2.0 3.0 2.0 1.0 1.0 4.0 4.0 4.0 8.0 5.0
191627 3 3 2 2.0 2 1 5 1 2 5 ... 1.0 3.0 2.0 2.0 4.0 5.0 3.0 3.0 7.0 3.0
191628 1 4 1 2.0 5 1 5 1 1 2 ... 3.0 2.0 1.0 1.0 1.0 5.0 5.0 2.0 4.0 3.0
191629 2 4 2 1.0 2 1 5 1 2 5 ... 2.0 4.0 2.0 1.0 5.0 3.0 2.0 2.0 5.0 1.0
191630 2 3 1 2.0 5 1 5 1 1 2 ... 2.0 3.0 1.0 1.0 1.0 5.0 5.0 4.0 7.0 5.0
191631 2 4 2 1.0 2 1 5 1 2 5 ... 2.0 4.0 2.0 1.0 2.0 4.0 3.0 4.0 7.0 5.0
191632 0 3 1 5.0 4 1 5 1 1 4 ... 4.0 2.0 0.0 0.0 1.0 3.0 4.0 3.0 1.0 3.0
191633 1 4 1 3.0 5 1 5 1 4 2 ... 4.0 2.0 0.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0
191634 2 3 1 1.0 5 1 5 1 2 3 ... 3.0 3.0 0.0 0.0 1.0 3.0 4.0 1.0 2.0 1.0
191635 2 4 2 2.0 5 1 5 2 1 3 ... 3.0 3.0 1.0 1.0 1.0 3.0 3.0 3.0 5.0 3.0
191636 2 4 1 1.0 5 1 5 1 1 2 ... 3.0 2.0 1.0 0.0 1.0 3.0 3.0 4.0 5.0 4.0
191637 -1 3 2 6.0 5 2 4 2 3 1 ... 2.0 3.0 1.0 0.0 1.0 3.0 3.0 3.0 7.0 5.0
191638 1 4 1 6.0 3 1 5 1 1 4 ... 0.0 3.0 3.0 2.0 5.0 4.0 1.0 3.0 7.0 3.0
191639 -1 3 1 5.0 4 3 3 2 4 1 ... 4.0 1.0 0.0 0.0 1.0 2.0 3.0 4.0 5.0 4.0
191640 3 3 1 4.0 4 1 5 1 2 4 ... 2.0 3.0 2.0 1.0 1.0 5.0 5.0 3.0 7.0 4.0
191641 1 4 1 2.0 5 1 5 1 2 2 ... 4.0 2.0 1.0 0.0 1.0 4.0 5.0 2.0 1.0 2.0
191642 2 4 2 2.0 2 1 5 2 1 5 ... 0.0 0.0 0.0 2.0 5.0 3.0 1.0 4.0 9.0 3.0
191643 2 4 1 5.0 5 1 5 1 3 2 ... 4.0 2.0 0.0 0.0 1.0 5.0 5.0 2.0 4.0 1.0
191644 2 4 2 6.0 2 1 5 1 2 5 ... 1.0 4.0 3.0 1.0 5.0 4.0 3.0 3.0 5.0 5.0
191645 2 4 1 5.0 5 1 5 1 3 2 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 3.0 5.0 3.0
191646 3 2 2 2.0 2 1 5 1 2 5 ... 1.0 4.0 2.0 0.0 5.0 1.0 1.0 3.0 9.0 5.0
191647 1 3 1 4.0 5 1 5 1 1 2 ... 2.0 4.0 2.0 1.0 2.0 5.0 4.0 3.0 8.0 5.0
191648 -1 4 2 2.0 5 1 5 2 2 3 ... NaN NaN NaN NaN NaN NaN NaN 1.0 4.0 1.0
191649 2 4 1 2.0 5 1 5 1 1 2 ... 3.0 2.0 2.0 1.0 1.0 5.0 5.0 3.0 7.0 5.0
191650 3 3 2 4.0 2 1 5 1 2 5 ... 3.0 2.0 1.0 1.0 1.0 2.0 3.0 3.0 4.0 4.0
191651 3 2 1 2.0 5 1 5 1 1 2 ... 3.0 2.0 0.0 0.0 1.0 4.0 5.0 1.0 3.0 1.0

191652 rows × 85 columns

In [76]:
# Apply preprocessing, feature transformation, and clustering from the general
# demographics onto the customer data, obtaining cluster predictions for the
# customer demographics data.
customers = clean_data(customers)
customers = customers.dropna(thresh=78)

# customers
In [91]:
customers_scaler = scaler_fit.transform(customers.astype(float))
customers
Out[91]:
ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP ... PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB OST_WEST_KZ_O OST_WEST_KZ_W PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_IS_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE
0 4.0 1 5.0 5 1 5 1 2 2 2 ... 5.0 1.0 2.0 1.0 0 1 50.0 False 1.0 3.0
2 4.0 2 2.0 5 1 5 1 4 4 2 ... 2.0 3.0 5.0 3.0 0 1 50.0 False 3.0 4.0
4 3.0 1 6.0 3 1 4 4 5 2 2 ... 3.0 3.0 5.0 1.0 0 1 70.0 True 4.0 1.0
5 3.0 1 4.0 5 1 5 1 2 3 5 ... 5.0 3.0 7.0 5.0 0 1 50.0 False 3.0 4.0
6 4.0 1 2.0 5 1 5 1 1 2 2 ... 5.0 2.0 3.0 2.0 0 1 50.0 False 2.0 3.0
7 4.0 1 2.0 5 1 5 1 2 2 5 ... 3.0 3.0 4.0 3.0 0 1 50.0 False 1.0 5.0
9 3.0 1 3.0 5 2 4 1 3 1 2 ... 5.0 3.0 6.0 4.0 0 1 70.0 False 1.0 5.0
11 4.0 1 3.0 5 1 5 1 3 2 2 ... 3.0 2.0 5.0 1.0 0 1 50.0 False 1.0 4.0
12 4.0 1 5.0 5 2 4 3 2 1 6 ... 5.0 1.0 1.0 1.0 0 1 70.0 True 2.0 2.0
14 4.0 2 2.0 3 1 5 1 2 5 5 ... 3.0 3.0 8.0 5.0 0 1 40.0 True 5.0 1.0
15 3.0 1 3.0 5 1 5 1 1 2 5 ... 5.0 3.0 6.0 1.0 0 1 60.0 True 1.0 5.0
16 4.0 1 2.0 5 1 5 1 3 2 5 ... 4.0 3.0 5.0 3.0 0 1 50.0 False 1.0 5.0
17 4.0 1 5.0 4 3 1 4 5 1 3 ... 5.0 1.0 3.0 1.0 0 1 90.0 False 1.0 4.0
18 2.0 1 4.0 2 4 2 2 3 3 1 ... 1.0 3.0 8.0 5.0 0 1 90.0 True 4.0 1.0
19 4.0 2 4.0 3 2 4 3 3 3 2 ... 4.0 3.0 5.0 5.0 0 1 70.0 True 2.0 4.0
21 4.0 1 1.0 3 1 5 2 1 4 6 ... 3.0 1.0 3.0 1.0 0 1 60.0 True 4.0 3.0
22 1.0 2 2.0 5 1 5 1 1 2 2 ... 5.0 1.0 2.0 1.0 0 1 40.0 True 2.0 4.0
23 3.0 1 6.0 3 3 4 1 2 2 5 ... 2.0 3.0 9.0 5.0 0 1 70.0 True 5.0 1.0
24 4.0 2 2.0 4 1 5 2 2 3 2 ... 3.0 2.0 6.0 3.0 0 1 50.0 True 3.0 3.0
25 3.0 2 3.0 5 2 4 4 2 1 6 ... 4.0 1.0 3.0 1.0 0 1 80.0 False 2.0 2.0
26 4.0 2 2.0 5 1 5 2 1 3 2 ... 5.0 1.0 2.0 1.0 0 1 50.0 True 2.0 5.0
27 3.0 1 3.0 5 2 3 1 2 1 5 ... 3.0 3.0 7.0 4.0 0 1 70.0 False 4.0 4.0
28 4.0 2 2.0 3 1 5 1 1 5 6 ... 3.0 3.0 7.0 5.0 0 1 40.0 True 5.0 4.0
29 3.0 2 4.0 3 1 5 1 2 5 5 ... 2.0 3.0 8.0 5.0 0 1 60.0 True 5.0 1.0
30 3.0 2 2.0 4 1 5 1 2 2 2 ... 5.0 2.0 3.0 3.0 0 1 60.0 True 3.0 2.0
31 3.0 1 2.0 5 2 3 3 2 1 6 ... 5.0 2.0 1.0 2.0 0 1 70.0 False 2.0 4.0
32 3.0 2 5.0 2 1 5 1 1 5 6 ... 2.0 4.0 7.0 3.0 1 0 70.0 True 5.0 1.0
33 4.0 1 2.0 5 1 5 1 2 3 2 ... 5.0 3.0 2.0 5.0 0 1 40.0 False 2.0 5.0
35 4.0 1 1.0 5 1 5 1 3 1 2 ... 3.0 3.0 4.0 2.0 0 1 60.0 True 1.0 4.0
41 4.0 2 2.0 4 1 5 1 2 3 2 ... 3.0 3.0 4.0 3.0 0 1 50.0 False 2.0 4.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
191619 4.0 1 4.0 4 1 5 1 1 5 5 ... 4.0 3.0 6.0 2.0 0 1 40.0 False 1.0 4.0
191620 4.0 2 2.0 2 1 5 1 1 4 5 ... 3.0 4.0 6.0 3.0 0 1 60.0 True 4.0 4.0
191621 4.0 2 2.0 3 1 5 2 1 5 6 ... 4.0 4.0 6.0 5.0 0 1 60.0 True 4.0 1.0
191622 4.0 1 2.0 5 1 5 1 1 2 2 ... 3.0 2.0 2.0 2.0 0 1 40.0 True 4.0 3.0
191623 4.0 1 2.0 5 1 5 1 3 2 5 ... 3.0 2.0 4.0 3.0 0 1 60.0 True 1.0 4.0
191624 4.0 1 2.0 5 1 5 1 4 2 5 ... 4.0 2.0 5.0 1.0 0 1 70.0 False 5.0 4.0
191625 4.0 2 4.0 5 1 5 2 1 2 2 ... 5.0 2.0 4.0 3.0 0 1 50.0 False 2.0 5.0
191626 2.0 1 4.0 3 3 3 1 3 2 5 ... 4.0 4.0 8.0 5.0 0 1 70.0 True 2.0 4.0
191627 3.0 2 2.0 2 1 5 1 2 5 2 ... 3.0 3.0 7.0 3.0 0 1 50.0 True 5.0 5.0
191628 4.0 1 2.0 5 1 5 1 1 2 5 ... 5.0 2.0 4.0 3.0 0 1 60.0 True 2.0 2.0
191629 4.0 2 1.0 2 1 5 1 2 5 5 ... 2.0 2.0 5.0 1.0 0 1 40.0 True 5.0 5.0
191630 3.0 1 2.0 5 1 5 1 1 2 5 ... 5.0 4.0 7.0 5.0 0 1 50.0 False 5.0 1.0
191632 3.0 1 5.0 4 1 5 1 1 4 5 ... 4.0 3.0 1.0 3.0 0 1 70.0 True 1.0 4.0
191633 4.0 1 3.0 5 1 5 1 4 2 5 ... 4.0 3.0 5.0 2.0 0 1 60.0 True 1.0 4.0
191634 3.0 1 1.0 5 1 5 1 2 3 2 ... 4.0 1.0 2.0 1.0 0 1 60.0 True 3.0 3.0
191635 4.0 2 2.0 5 1 5 2 1 3 2 ... 3.0 3.0 5.0 3.0 0 1 40.0 True 4.0 5.0
191636 4.0 1 1.0 5 1 5 1 1 2 5 ... 3.0 4.0 5.0 4.0 0 1 40.0 False 3.0 4.0
191637 3.0 2 6.0 5 2 4 2 3 1 6 ... 3.0 3.0 7.0 5.0 0 1 70.0 False 2.0 5.0
191638 4.0 1 6.0 3 1 5 1 1 4 5 ... 1.0 3.0 7.0 3.0 0 1 60.0 True 5.0 2.0
191640 3.0 1 4.0 4 1 5 1 2 4 5 ... 5.0 3.0 7.0 4.0 0 1 70.0 False 1.0 5.0
191641 4.0 1 2.0 5 1 5 1 2 2 2 ... 5.0 2.0 1.0 2.0 0 1 50.0 False 1.0 4.0
191642 4.0 2 2.0 2 1 5 2 1 5 6 ... 1.0 4.0 9.0 3.0 1 0 50.0 True 5.0 5.0
191643 4.0 1 5.0 5 1 5 1 3 2 5 ... 5.0 2.0 4.0 1.0 0 1 60.0 True 1.0 5.0
191644 4.0 2 6.0 2 1 5 1 2 5 5 ... 3.0 3.0 5.0 5.0 0 1 70.0 False 4.0 1.0
191645 4.0 1 5.0 5 1 5 1 3 2 5 ... 3.0 3.0 5.0 3.0 0 1 70.0 True 4.0 3.0
191646 2.0 2 2.0 2 1 5 1 2 5 5 ... 1.0 3.0 9.0 5.0 0 1 70.0 True 1.0 4.0
191647 3.0 1 4.0 5 1 5 1 1 2 5 ... 4.0 3.0 8.0 5.0 0 1 50.0 False 1.0 4.0
191649 4.0 1 2.0 5 1 5 1 1 2 5 ... 5.0 3.0 7.0 5.0 0 1 50.0 False 2.0 4.0
191650 3.0 2 4.0 2 1 5 1 2 5 2 ... 3.0 3.0 4.0 4.0 0 1 70.0 True 2.0 4.0
191651 2.0 1 2.0 5 1 5 1 1 2 6 ... 5.0 1.0 3.0 1.0 0 1 50.0 True 3.0 3.0

115643 rows × 78 columns

In [92]:
customer_transform = pca.transform(customers_scaler)
customers_predict = kmeans.predict(customer_transform)

Step 3.3: Compare Customer Data to Demographics Data

At this point, you have clustered data based on demographics of the general population of Germany, and seen how the customer data for a mail-order sales company maps onto those demographic clusters. In this final substep, you will compare the two cluster distributions to see where the strongest customer base for the company is.

Consider the proportion of persons in each cluster for the general population, and the proportions for the customers. If we think the company's customer base to be universal, then the cluster assignment proportions should be fairly similar between the two. If there are only particular segments of the population that are interested in the company's products, then we should see a mismatch from one to the other. If there is a higher proportion of persons in a cluster for the customer data compared to the general population (e.g. 5% of persons are assigned to a cluster for the general population, but 15% of the customer data is closest to that cluster's centroid) then that suggests the people in that cluster to be a target audience for the company. On the other hand, the proportion of the data in a cluster being larger in the general population than the customer data (e.g. only 2% of customers closest to a population centroid that captures 6% of the data) suggests that group of persons to be outside of the target demographics.

Take a look at the following points in this step:

  • Compute the proportion of data points in each cluster for the general population and the customer data. Visualizations will be useful here: both for the individual dataset proportions, but also to visualize the ratios in cluster representation between groups. Seaborn's countplot() or barplot() function could be handy.
    • Recall the analysis you performed in step 1.1.3 of the project, where you separated out certain data points from the dataset if they had more than a specified threshold of missing values. If you found that this group was qualitatively different from the main bulk of the data, you should treat this as an additional data cluster in this analysis. Make sure that you account for the number of data points in this subset, for both the general population and customer datasets, when making your computations!
  • Which cluster or clusters are overrepresented in the customer dataset compared to the general population? Select at least one such cluster and infer what kind of people might be represented by that cluster. Use the principal component interpretations from step 2.3 or look at additional components to help you make this inference. Alternatively, you can use the .inverse_transform() method of the PCA and StandardScaler objects to transform centroids back to the original data space and interpret the retrieved values directly.
  • Perform a similar investigation for the underrepresented clusters. Which cluster or clusters are underrepresented in the customer dataset compared to the general population, and what kinds of people are typified by these clusters?
In [93]:
# Compare the proportion of data in each cluster for the customer data to the
# proportion of data in each cluster for the general population.

# First few clusters seems to have more
sns.barplot(data=kmeans.cluster_centers_)
Out[93]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe2a458be48>
In [94]:
customers_predict = pd.DataFrame(customers_predict)
# customer_pca = pd.DataFrame(pca.transform(customers_predict))
In [95]:
customer_transform = pd.DataFrame(customer_transform)
In [96]:
# What kinds of people are part of a cluster that is underrepresented in the
# customer data compared to the general population?
# map_weights(customer_pca,0)


figure(figsize=(15, 35), dpi=80)

for cluster in range(30):
    plt.subplot(15, 2, cluster+1) 
    plt.scatter(range(115643), customer_transform[cluster])
plt.show()
In [97]:
oo = pd.DataFrame(X_pca)
oo
Out[97]:
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 29
0 3.949997 -3.013148 -3.236146 0.478502 0.514132 -2.306559 2.794105 0.811305 -0.714605 -0.998052 ... -0.596794 -0.085042 0.857023 -1.358757 -1.641817 -0.322442 -0.220226 0.004213 0.209369 0.624098
1 0.105194 0.337246 -3.299623 -0.941872 -3.246496 0.140206 0.469969 -0.736975 0.501846 -2.047131 ... 0.227457 -0.469077 -0.853872 0.186317 -0.157310 -0.469687 -0.142685 -0.359988 0.169718 1.628511
2 -0.338343 -0.320505 2.903046 3.728714 1.507268 -1.545791 -3.154437 -0.554980 -0.489044 -1.242942 ... -0.223128 0.021010 -0.222510 -0.217555 0.057762 -1.065326 1.012935 -0.729736 -1.271023 0.386642
3 -0.239080 -1.207499 -0.885322 -3.899993 -0.157209 -2.305500 0.329099 -0.823080 -0.450346 3.343934 ... -0.953149 -0.445040 -0.542938 0.425630 -0.998041 -0.272813 0.397450 -0.471995 -0.745013 1.387893
4 1.313761 -2.987892 -3.407476 -2.721598 -1.109844 -0.630054 -0.459891 -1.127291 -0.503651 2.224992 ... -1.075320 -0.390202 0.636285 -0.262523 1.171886 -1.319311 0.220315 0.363186 -1.217884 -0.860152
5 -0.310616 -2.590188 3.950976 -1.908772 -2.830632 -0.540085 1.010331 -0.604339 2.258646 1.594789 ... -1.221001 0.724148 0.606766 -0.305716 1.318258 -0.210865 1.077003 -1.882438 0.678523 -0.572615
6 -1.846836 0.261628 1.225233 -0.168426 -3.298825 1.925985 0.399782 1.160184 -0.495193 -2.171843 ... 0.675479 0.064149 -1.711470 0.753533 0.812566 -0.430989 -0.062060 1.258877 -0.369112 1.316448
7 -0.012484 0.917599 -3.758976 -1.176802 -2.932953 0.575826 1.522742 -0.071179 1.737901 -4.536572 ... 0.947597 0.875566 -1.742501 0.305806 -0.161824 -0.681743 -1.293598 -0.160724 -2.408668 -0.467985
8 2.143714 3.968448 -0.936030 -1.859025 1.119619 -1.205742 -2.782969 -1.058216 0.993085 0.986428 ... -0.314468 -0.663922 -1.076137 0.456637 -0.856000 -1.163610 0.798445 -0.033860 -0.784331 -1.566698
9 -1.997339 0.652021 2.731827 1.545468 -4.357918 0.336781 0.111420 -1.375281 -0.776869 -0.764724 ... 0.155201 0.675928 -1.570123 -0.247608 -0.469567 -0.518286 -1.158051 0.218737 0.468004 -1.411885
10 3.205311 -2.948698 3.773564 0.192072 -1.895402 0.229426 -2.298657 -0.677231 -1.104434 1.059875 ... -0.535190 1.739075 -0.819979 -1.783292 1.689534 0.157629 -0.830063 -0.153094 -0.728768 -0.694804
11 4.920649 -4.016969 2.119548 -2.079268 -0.352483 -1.271818 -1.991103 -0.235011 0.032108 -1.084064 ... -0.392590 -0.227101 -0.600143 0.472055 -1.342909 -0.621734 0.506137 -0.152769 -0.341111 -0.114447
12 0.416129 -0.392235 1.547887 3.777251 -2.154239 0.041147 -0.390521 -0.339423 -2.134471 -2.216419 ... -0.559246 -0.308608 0.234253 2.188292 -0.433219 0.107054 -0.452433 0.317880 0.123398 1.542362
13 -0.948889 -0.260814 1.480375 -3.915282 -0.183846 -2.724664 -0.208437 -0.251901 -2.039616 0.139258 ... -0.113113 0.010585 -0.747028 0.737980 -0.648339 -1.028550 -1.069157 0.803668 -0.331032 0.844022
14 6.917557 -0.963716 -2.251227 0.609952 -1.868550 -0.215608 -1.876595 -0.818806 1.776320 1.222191 ... -1.321691 -1.511386 0.973651 0.430069 0.281986 1.411803 0.516815 0.148599 0.489559 -0.223326
15 -2.895523 -5.163762 3.951514 -0.428787 -2.363333 1.823851 -0.208927 -0.067016 1.170407 -0.378748 ... -0.344663 0.237030 0.266812 -1.066559 1.306437 1.191034 0.015807 -1.072384 -1.126388 -0.620960
16 2.042330 -0.134151 -2.082199 4.598148 -1.405942 -1.367074 0.389236 -1.385916 1.959462 1.822730 ... -0.110193 -1.121100 1.167177 0.401808 0.077373 -0.454324 0.557396 0.210583 0.590188 -0.687763
17 2.932494 3.323687 -2.158152 0.041518 -0.633034 -0.037277 -1.461242 0.618634 -0.454359 0.128389 ... -0.584762 0.507057 1.054767 1.260941 0.143917 2.362746 0.886110 -0.445487 0.158079 -0.666980
18 -5.947243 0.944581 1.879224 1.369807 2.172663 -0.999358 -0.651607 0.566651 0.351928 0.533579 ... -0.080662 0.710160 -0.067462 0.016062 0.115664 -0.543778 -0.147926 0.461738 0.353802 -1.504279
19 -1.479810 -3.722241 -3.612974 -2.735628 -0.882829 1.106300 -1.204438 0.533939 0.503482 2.461820 ... 0.830879 -1.756510 0.102410 -0.188824 1.138861 -0.231780 -1.325890 0.066884 1.293352 -0.335783
20 -2.665494 -4.423918 -4.339253 3.430962 -2.442061 0.052160 0.829209 0.384740 -0.332118 -0.500150 ... -0.626310 0.562291 -0.355641 -0.337123 -0.747100 0.341853 -0.098396 -0.387132 -0.417418 -0.391705
21 3.044484 -2.666307 -4.329059 -2.003486 -2.346415 0.009283 0.190717 -0.496739 0.923753 -1.190749 ... -0.229886 0.887005 -0.647417 -0.421913 -0.958186 0.321089 -0.792853 -0.095878 1.073571 0.800260
22 -2.164006 0.622952 -2.598277 -3.582275 1.438658 0.254666 -0.274503 1.634778 0.932352 -1.335557 ... 0.693641 0.353877 -0.876055 -0.658144 -0.477300 0.331345 0.060630 -1.866046 0.641657 -0.042526
23 -6.434412 1.316708 -2.347813 1.119031 3.223051 -2.805069 0.539145 -1.160911 -0.990382 0.728901 ... -0.418842 0.231876 0.236968 0.386064 -0.536878 -0.039249 1.186091 0.449383 -0.225742 1.324710
24 -2.254230 4.432085 -0.838220 -0.758392 -2.658798 1.029060 0.162855 -0.437339 1.813997 -0.307855 ... 0.592081 0.379504 -0.980383 -1.134777 -0.057384 -1.023852 -0.410287 -0.702155 -0.185200 0.778944
25 0.518522 -4.163555 1.824621 -2.222696 -1.095873 -0.961367 0.491120 -1.462010 0.143855 -0.335937 ... 1.487236 -1.144664 0.739934 -0.744413 1.212257 0.724807 0.954258 0.864025 1.691202 -2.325453
26 -0.667844 -2.066782 0.097888 -3.877922 0.742926 -0.752172 -1.010716 1.501767 -0.518927 -2.315415 ... 0.077491 -0.807342 0.354083 -0.100719 0.159586 -1.974149 0.547024 1.025504 -1.816926 0.553144
27 -1.941812 -6.901071 2.125991 0.004548 -0.822106 -0.221912 1.025442 -0.508373 1.749730 -1.487379 ... -0.369781 -0.018850 0.834831 1.389035 0.290865 0.057223 1.758865 -1.619690 -2.087890 -0.096301
28 5.683261 -0.274508 -2.783472 1.569293 -2.571234 -0.398993 0.979689 0.395008 -0.120693 -1.155215 ... -0.584361 2.536782 0.674071 1.319797 -0.263191 0.236025 0.538794 0.886950 0.096870 -0.128067
29 3.341149 2.936977 -2.251512 0.270906 -1.737287 -1.916488 -1.174362 0.037535 -1.411083 0.656104 ... 1.541078 -0.429229 0.463386 -0.075340 -0.635914 -0.968667 2.165534 -2.421432 -0.100449 0.742548
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
623179 -4.805886 -0.997439 2.045499 4.812685 -1.060512 0.983081 -0.689051 0.516972 0.328985 1.027129 ... -0.433316 0.011758 0.110395 0.391093 -0.413342 0.174972 0.177472 0.531711 0.404652 -0.552305
623180 -1.977272 -5.236508 3.523499 4.281505 -1.807004 1.071697 -1.623605 0.425518 1.110567 0.926977 ... -0.345072 0.020743 -0.592099 -0.328972 0.429846 0.283601 0.158535 -1.054620 -0.700362 -0.718361
623181 -1.032683 -3.977072 -2.348450 -3.410722 0.124881 1.620894 -1.026453 1.737276 1.567230 -0.478750 ... 0.260543 0.500355 -1.432800 1.038377 -0.674595 0.027433 -0.862005 0.289863 -0.986190 -0.807814
623182 -0.282474 0.088427 -1.991714 0.013014 -4.155650 2.059461 -1.364881 0.562925 0.055202 2.032752 ... 0.134282 0.703447 -0.954986 0.016817 -1.595061 0.457650 -1.207923 0.011531 -0.703820 -0.310057
623183 -4.034190 1.262412 2.880761 -0.191788 -3.453343 0.284367 -0.684884 0.579654 -1.438758 -0.034213 ... -0.403519 0.160607 -1.160838 0.009492 -0.114619 0.626067 -0.207747 -1.488978 0.990424 0.126836
623184 -2.726459 3.809833 -1.542061 -2.448503 -1.329626 -1.975012 1.184950 0.201286 0.359474 -0.006497 ... 0.440413 0.931542 0.178429 -0.300041 -0.671610 -0.745211 0.746071 0.703801 -1.148906 0.092834
623185 1.843866 -3.634608 -3.652924 -0.344172 2.537629 -1.666139 -0.536766 0.082779 2.437381 0.823674 ... -1.028685 -1.145868 -0.312081 0.567382 -0.622509 2.831687 -1.068728 -0.066496 1.005889 -0.021059
623186 0.452517 2.142572 3.211819 -0.281477 1.049264 -0.318030 -2.820168 -0.362523 2.578561 0.741411 ... 0.062889 -0.176351 -0.357315 0.198716 -0.381669 -0.478197 0.713760 0.098017 -0.924797 0.313582
623187 -2.931319 -1.417819 2.027466 2.020411 -1.768589 0.953904 -2.040144 0.637089 -3.250414 -0.691680 ... -1.347785 -1.015744 -0.937706 -0.425792 -0.105611 0.631766 -1.075965 0.266646 0.006824 1.377374
623188 -0.387170 4.120025 -1.749964 3.010654 2.010200 -2.326395 -0.737502 -0.819271 -0.576684 -1.612696 ... 0.272187 -0.446833 0.644608 -1.280603 -0.459852 1.584508 -0.248426 1.113095 -0.397363 0.154607
623189 2.323461 -4.104053 -3.805605 0.989226 1.649745 -0.550362 -0.772020 -0.249540 1.571979 0.551886 ... 0.775899 -0.684549 0.361615 -0.446052 1.490204 0.956008 -1.276170 1.477934 -0.102551 -0.933625
623190 -1.102604 2.967614 -2.078838 1.593295 -3.907586 0.995944 -1.078867 -0.026647 -0.440625 0.144348 ... -0.085416 -0.074528 -0.267491 1.221833 1.218156 0.409748 -1.257468 0.447493 -0.299972 -0.109926
623191 0.306081 -1.665135 0.161282 -0.285878 -4.539199 0.279690 -0.868346 0.570197 -1.685779 -3.167898 ... -0.274588 -0.884391 0.561742 -0.903199 -0.070939 1.359252 -0.218803 -0.471779 -0.258862 0.726190
623192 5.475941 -2.034855 2.269396 -1.043436 -1.365498 -1.663489 0.281433 -1.683598 1.076400 0.106570 ... 1.469561 0.667051 0.024851 -0.703954 0.263111 -0.465861 1.208104 0.128245 0.491623 0.430811
623193 3.967287 -3.732733 -2.614898 3.836484 1.689078 -1.230795 -1.614582 0.605349 0.582644 0.363018 ... 1.194550 0.987791 -0.328780 -1.212375 -0.194115 -1.562854 1.368557 -0.540574 -0.205461 0.219819
623194 2.396171 0.851819 1.806362 -0.132925 -0.910536 -0.384357 -1.136829 -0.056185 -1.109493 -1.014012 ... -0.437727 0.010906 0.052119 -0.300754 -0.006086 -0.366058 0.323106 -0.801605 0.835612 0.278650
623195 6.581110 -0.572199 -1.979466 1.380068 -2.590663 0.684009 1.622745 1.927194 1.226343 -0.816718 ... -0.286908 1.452226 -0.182014 -0.732314 -0.411965 1.660700 1.377704 -0.061396 -0.323563 0.395816
623196 -3.228725 -2.086527 1.859887 2.924615 1.468702 -0.723304 -1.654605 0.491350 -1.943498 -0.909155 ... -0.604749 0.176918 0.165221 -0.763540 1.474471 -0.105870 0.693876 -1.402234 0.874488 -1.015509
623197 0.762496 2.050676 2.568008 2.394055 1.682646 -2.213585 -2.132886 -0.770521 1.889562 0.316797 ... -0.129389 1.199827 -1.237763 -0.788068 0.138123 -0.988669 -0.075420 -1.613372 -0.623262 0.477402
623198 -1.385344 -3.931129 -3.266310 -2.556040 -1.756259 0.437774 0.255037 -0.198084 -1.302950 0.355747 ... -1.185086 0.327269 0.537034 -0.550215 2.194894 -0.389397 -1.011644 0.336262 1.027271 0.478216
623199 -2.608662 3.374415 -2.064094 0.474328 -1.690004 -0.164667 0.127553 0.666293 1.145767 1.222547 ... 0.887531 0.208439 -0.456372 0.416389 -0.253654 -0.627460 0.744136 -0.004459 1.277339 1.997755
623200 3.745814 -2.411621 -3.349815 2.005522 0.745111 -2.946645 -0.024248 -2.454608 0.540620 -0.776872 ... 2.735525 -1.285712 0.517676 -0.831996 -0.462469 1.209983 -0.259009 1.133507 -0.050587 -0.147128
623201 3.976701 1.445332 2.162383 0.483628 -1.115523 -0.187945 -0.358673 3.330443 -1.881179 -0.592328 ... -0.237711 0.444869 0.121966 1.578872 -1.851622 -0.144590 0.075060 -1.380095 -0.214696 -1.121540
623202 2.524139 1.332626 1.041822 -2.226888 0.670485 -1.159363 -3.294666 -0.637504 1.501865 0.703321 ... -0.547390 -0.340505 -0.671116 1.102302 -0.325859 -0.658836 -0.518428 0.875179 0.385678 0.732265
623203 3.653874 5.551246 4.379990 0.564691 -0.151747 -0.616247 0.146997 1.343542 1.867253 0.840406 ... -0.137285 0.568099 0.669531 0.507003 -0.606988 0.206441 -0.255892 -0.346242 0.666179 -0.233578
623204 -3.366527 -1.040649 -2.470758 5.347448 0.573390 -2.356476 0.654496 -1.736436 -1.052681 -0.066652 ... -0.231799 -1.112834 -0.261645 0.568953 1.610026 0.270050 -1.414568 -0.956756 -0.736399 0.728940
623205 3.845887 -0.867727 -2.847747 -0.130896 -2.889372 0.270811 0.389234 -0.925095 0.482924 -1.035531 ... -0.502609 1.657613 0.328134 -1.835625 0.483756 -1.322782 1.077318 -0.211484 -0.789880 1.151815
623206 4.437725 -2.371959 -3.580663 -1.370023 -1.418419 -2.144084 -0.001043 -1.411015 1.223933 0.603835 ... 0.940346 -1.776359 0.449213 -0.801183 1.420195 0.357965 -0.296901 1.039077 1.214028 -0.712326
623207 5.050301 -4.128519 2.207496 2.339435 2.656376 -0.831063 -3.607131 0.072440 2.618152 -0.738423 ... -0.971049 0.196644 -0.274640 0.488693 0.829695 0.088572 -0.184682 0.017016 -1.527631 0.221203
623208 0.171784 1.773576 2.975977 -3.574843 1.154342 -1.886434 -0.298012 -0.610091 0.609777 -0.069524 ... 0.604622 1.053930 -1.490906 0.348634 -1.607911 -0.055219 -0.618665 -0.819311 1.881874 -0.260067

623209 rows × 30 columns

In [98]:
print("If the cluster is True, this cluster is over presented.")
for c in customer_transform:
    print(f"cluster {c} over presented? {customer_transform[c].mean() < X_pca[c].mean()}")
If the cluster is True, this cluster is over presented.
cluster 0 over presented? True
cluster 1 over presented? False
cluster 2 over presented? False
cluster 3 over presented? False
cluster 4 over presented? True
cluster 5 over presented? True
cluster 6 over presented? False
cluster 7 over presented? False
cluster 8 over presented? False
cluster 9 over presented? False
cluster 10 over presented? False
cluster 11 over presented? False
cluster 12 over presented? False
cluster 13 over presented? False
cluster 14 over presented? True
cluster 15 over presented? False
cluster 16 over presented? True
cluster 17 over presented? True
cluster 18 over presented? True
cluster 19 over presented? False
cluster 20 over presented? False
cluster 21 over presented? False
cluster 22 over presented? False
cluster 23 over presented? False
cluster 24 over presented? False
cluster 25 over presented? True
cluster 26 over presented? False
cluster 27 over presented? False
cluster 28 over presented? True
cluster 29 over presented? False
Answer:
By the looks of it, cluster 0, 4 and 5 (and other True results) are over presented, clusters like 1 and 3 are under presented
In [99]:
customers.iloc[115638]
Out[99]:
ALTERSKATEGORIE_GROB                      2
ANREDE_KZ                                 2
CJT_GESAMTTYP                             2
FINANZ_MINIMALIST                         2
FINANZ_SPARER                             1
FINANZ_VORSORGER                          5
FINANZ_ANLEGER                            1
FINANZ_UNAUFFAELLIGER                     2
FINANZ_HAUSBAUER                          5
FINANZTYP                                 5
GFK_URLAUBERTYP                          11
GREEN_AVANTGARDE                          0
HEALTH_TYP                                2
LP_LEBENSPHASE_FEIN                       5
LP_LEBENSPHASE_GROB                       2
LP_FAMILIE_FEIN                           1
LP_FAMILIE_GROB                           1
LP_STATUS_FEIN                            1
LP_STATUS_GROB                            1
NATIONALITAET_KZ                          1
RETOURTYP_BK_S                            5
SEMIO_SOZ                                 2
SEMIO_FAM                                 5
SEMIO_REL                                 7
SEMIO_MAT                                 3
SEMIO_VERT                                3
SEMIO_LUST                                2
SEMIO_ERL                                 5
SEMIO_KULT                                6
SEMIO_RAT                                 7
                                       ... 
WOHNLAGE                                  4
KBA05_ANTG1                               1
KBA05_ANTG2                               4
KBA05_ANTG3                               1
KBA05_ANTG4                               0
KBA05_GBZ                                 3
BALLRAUM                                  1
EWDICHTE                                  6
INNENSTADT                                2
GEBAEUDETYP_RASTER                        3
KKK                                       3
MOBI_REGIO                                3
ONLINE_AFFINITAET                         1
REGIOTYP                                  7
KBA13_ANZAHL_PKW                         68
PLZ8_ANTG1                                1
PLZ8_ANTG2                                4
PLZ8_ANTG3                                2
PLZ8_ANTG4                                0
PLZ8_HHZ                                  1
PLZ8_GBZ                                  1
ARBEIT                                    3
ORTSGR_KLS9                               9
RELAT_AB                                  5
OST_WEST_KZ_O                             0
OST_WEST_KZ_W                             1
PRAEGENDE_JUGENDJAHRE_DECADE             70
PRAEGENDE_JUGENDJAHRE_IS_MAINSTREAM    True
CAMEO_INTL_2015_WEALTH                    1
CAMEO_INTL_2015_LIFE                      4
Name: 191646, Length: 78, dtype: object
PRAEGENDE_JUGENDJAHRE_DECADE and KBA13_ANZAHL_PKW seem to have a big impact.

Discussion 3.3: Compare Customer Data to Demographics Data

(Double-click this cell and replace this text with your own text, reporting findings and conclusions from the clustering analysis. Can we describe segments of the population that are relatively popular with the mail-order company, or relatively unpopular with the company?)

Answer:

From my observation, in the first component, column KBA13_ANZAHL_PKW makes a huge difference for the population, people with high value in this coumn are relatively popular with the mail-order company PRAEGENDE_JUJENDJAHRE_DECADE is the same thing for the third component.

Congratulations on making it this far in the project! Before you finish, make sure to check through the entire notebook from top to bottom to make sure that your analysis follows a logical flow and all of your findings are documented in Discussion cells. Once you've checked over all of your work, you should export the notebook as an HTML document to submit for evaluation. You can do this from the menu, navigating to File -> Download as -> HTML (.html). You will submit both that document and this notebook for your project submission.

In [ ]: